[NA/APPS] - Elevated applications errors
Incident Report for SashiDo.io
Postmortem

Summary On 10.07.2023, our infrastructure experienced a significant incident due to the failure of multiple SSDs in our RAID arrays. Prompt hardware replacement and rebuilding efforts were undertaken to restore functionality. However, intermittent outages occurred for specific applications due to network rules and routing mesh synchronization issues. We have requested new hardware from our data center provider to ensure long-term service quality. Due to synchronization requirements, rebuilding indexes and cache for services like uCDN and file service took longer than expected.

Root Cause Analysis The incident initially presented challenges in detecting the problem, as the faulty SSDs were intermittently functioning, causing random instances of severe slowness. This made it difficult for automated systems to detect and address the issue promptly. Once identified, we promptly replaced the faulty hardware and began rebuilding. Synchronization issues with networking rules and routing mesh also contributed to intermittent outages.

Mitigation and Future Preparations To improve our response and detection capabilities, we will implement the following measures:

  1. Enhanced monitoring for early detection: We will augment our monitoring systems to better identify and respond to hardware-related issues, even when they present as intermittent problems.
  2. Proactive incident response: We will refine our incident response procedures to ensure swift action in the event of hardware failures, prioritizing rapid detection and resolution.
  3. Strengthened synchronization processes: We will review and optimize our synchronization procedures to minimize downtime and expedite the recovery of services requiring index and cache rebuilding.
  4. Continual improvement: We will leverage the lessons learned from this incident to refine our infrastructure, monitoring, and response mechanisms, further enhancing our ability to mitigate similar incidents in the future.

Conclusion We apologize for the inconvenience caused by the recent infrastructure incident. Although the initial detection of the problem was challenging due to the intermittent nature of the faulty SSDs, we promptly replaced the hardware and initiated the necessary rebuilds. We have requested new hardware from our data center provider to ensure long-term service quality.

We remain committed to enhancing our capabilities and response mechanisms to prevent and address incidents effectively. We appreciate your understanding during this time. If you have any questions or concerns, please get in touch with our support team.

Posted Jul 11, 2023 - 15:43 UTC

Resolved
Good news! Things should be back to normal again across the board. Please let us know if there is any further trouble. Thank you so much for your patience while we worked to sort this out.
Posted Jul 10, 2023 - 18:15 UTC
Monitoring
All applications have returned to healthy state. We will continue to monitor the situation.
Posted Jul 10, 2023 - 17:45 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Jul 10, 2023 - 11:56 UTC
Investigating
We are currently investigating this issue.
Posted Jul 10, 2023 - 04:56 UTC
This incident affected: Application Platform (North America).