On the evening of July 11th, 2024, fireTMS experienced a significant outage due to an incident with our cloud provider OVH, which led to the unexpected shutdown of our main database. The main database is critical for our operations, handling all user data and transactions. This report outlines the timeline of events, our response, and the steps taken to restore full functionality to our services.
Replica server – A replica server is a copy of the main database server that stores the same data and is synchronised with the main server in real time. The machine is in a different location from the main database.
Virtualisation – a technology that allows multiple virtual machines to run on a single physical server, allowing for better resource utilisation, flexibility and isolation of environments.
Work server – is a server that processes tasks in the background, handling various events and external operations, offloading the main application.
RAID – for technology that combines multiple disks into any system, increasing performance and providing data protection against disk failure.UI server – is a server that manages user interactions with the application interface, processing requests and providing real-time responses.
– 21:59:15: OVH unexpectedly powered off the fireTMS main database server.
– 07:28: Employees discovered the database server was down and attempted to access it via the OVH panel IPMI protocol. The server status was “Powered Off.”
Quick search by OVH status page showed that our server had an “Incident”.
With vague information:
– 07:37: We switched to our replica database located in a different datacenter in another country.
– 07:57: UI servers were restarted after switching to the replica database.
– 08:12: Degraded application performance was noted due to insufficient CPU resources on the VM instance. The worker server was also underperforming, prompting us to plan a switch to a backup datacenter.
– 22:20: Successfully switched the worker server to backup datacenter, resulting in improved performance on event consumption.
– Decision: Opted to purchase two new UI servers in a backup datacenter with a 24-hour delivery time, as the delivery time for a similar server to our main database was over 72 hours.
– 19:08: OVH powered on our main database server.
– 10:00: Began preparations to switch to new servers in a backup datacenter and checked the status of the main database server.
– 10:14: Discovered that RAID partitions were missing. Diagnostics confirmed the issue, leading to a support ticket for OVH to replace the disks.
– Procedure: Waiting for OVH to replace faulty disks, as the server might go offline at any moment, interrupting synchronisation.
– 17:54: Redirected all traffic to the backup datacenter servers, and initial performance tests were positive.
– 08:19: Noted degraded performance of fireTMS.
– 09:02: Server tweaks and restarts were performed to resolve performance issues, but disk I/O exhaustion was identified as the root cause.
– 13:26: Received an email from OVH scheduling disk replacement.
– 21:20: OVH started their work on disk replacement.
– Morning: Our team noticed disk replacement had been completed and began rebuilding the main database. Due to the low availability of read/write operations, the team decided to move the work to 3:00 p.m.
– 15:00: Resynchronization of the database began and continued until 21:22.
– 22:34: The first UI server in our main datacenter was brought online, and user traffic was directed to it.- 21:55: All infrastructure was restored to the pre-incident state.
The incident began with an unexpected shutdown of our main database server by OVH, followed by a series of events that required quick decision-making and adaptation. Our response involved switching to a replica database, purchasing new servers, and waiting for OVH to replace faulty disks. The performance issues were mainly due to disk I/O exhaustion on the replication server.
By breaking down these technical details, we hope to provide a clearer understanding of what happened and how we addressed the issue.
1. Review and Enhance Backup Procedures: Evaluate our current backup and failover strategies to ensure quicker recovery in future incidents.
2. Optimise Resource Allocation: Ensure that VMs have adequate CPU and I/O resources to handle unexpected loads.
3. Upgrade backup infrastructure: Plan to remove virtualisation from replica database and upgrade our backup infrastructure so it can handle full production traffic without performance degradation.
We regularly provide information about our system and on topics in the TSL industry.
Subscribe to our newsletter and stay up to date.
Zespół fireTMS
Artykuł został napisany przez zespół fireTMS, w oparciu o wiedzę, doświadczenie i znajomość branży TSL.