Summary On 10.07.2023, our infrastructure experienced a significant incident due to the failure of multiple SSDs in our RAID arrays. Prompt hardware replacement and rebuilding efforts were undertaken to restore functionality. However, intermittent outages occurred for specific applications due to network rules and routing mesh synchronization issues. We have requested new hardware from our data center provider to ensure long-term service quality. Due to synchronization requirements, rebuilding indexes and cache for services like uCDN and file service took longer than expected.
Root Cause Analysis The incident initially presented challenges in detecting the problem, as the faulty SSDs were intermittently functioning, causing random instances of severe slowness. This made it difficult for automated systems to detect and address the issue promptly. Once identified, we promptly replaced the faulty hardware and began rebuilding. Synchronization issues with networking rules and routing mesh also contributed to intermittent outages.
Mitigation and Future Preparations To improve our response and detection capabilities, we will implement the following measures:
Conclusion We apologize for the inconvenience caused by the recent infrastructure incident. Although the initial detection of the problem was challenging due to the intermittent nature of the faulty SSDs, we promptly replaced the hardware and initiated the necessary rebuilds. We have requested new hardware from our data center provider to ensure long-term service quality.
We remain committed to enhancing our capabilities and response mechanisms to prevent and address incidents effectively. We appreciate your understanding during this time. If you have any questions or concerns, please get in touch with our support team.