On the evening of July 11th, 2024, fireTMS experienced a significant outage due to an incident with our cloud provider OVH, which led to the unexpected shutdown of our main database. The main database is critical for our operations, handling all user data and transactions. This report outlines the timeline of events, our response, and the steps taken to restore full functionality to our services.
Glossary
Replica server – A replica server is a copy of the main database server that stores the same data and is synchronised with the main server in real time. The machine is in a different location from the main database.
Virtualisation – a technology that allows multiple virtual machines to run on a single physical server, allowing for better resource utilisation, flexibility and isolation of environments.
Work server – is a server that processes tasks in the background, handling various events and external operations, offloading the main application.
RAID – for technology that combines multiple disks into any system, increasing performance and providing data protection against disk failure.UI server – is a server that manages user interactions with the application interface, processing requests and providing real-time responses.
Timeline of Events
July 11th, 2024
– 21:59:15: OVH unexpectedly powered off the fireTMS main database server.
July 12th, 2024
– 07:28: Employees discovered the database server was down and attempted to access it via the OVH panel IPMI protocol. The server status was “Powered Off.”
Quick search by OVH status page showed that our server had an “Incident”.
With vague information:
– 07:37: We switched to our replica database located in a different datacenter in another country.
– 07:57: UI servers were restarted after switching to the replica database.
– 08:12: Degraded application performance was noted due to insufficient CPU resources on the VM instance. The worker server was also underperforming, prompting us to plan a switch to a backup datacenter.
– 22:20: Successfully switched the worker server to backup datacenter, resulting in improved performance on event consumption.
July 13th, 2024
– Decision: Opted to purchase two new UI servers in a backup datacenter with a 24-hour delivery time, as the delivery time for a similar server to our main database was over 72 hours.
– 19:08: OVH powered on our main database server.
July 14th, 2024
– 10:00: Began preparations to switch to new servers in a backup datacenter and checked the status of the main database server.
– 10:14: Discovered that RAID partitions were missing. Diagnostics confirmed the issue, leading to a support ticket for OVH to replace the disks.
– Procedure: Waiting for OVH to replace faulty disks, as the server might go offline at any moment, interrupting synchronisation.
– 17:54: Redirected all traffic to the backup datacenter servers, and initial performance tests were positive.
July 15th, 2024
– 08:19: Noted degraded performance of fireTMS.
– 09:02: Server tweaks and restarts were performed to resolve performance issues, but disk I/O exhaustion was identified as the root cause.
– 13:26: Received an email from OVH scheduling disk replacement.
– 21:20: OVH started their work on disk replacement.
July 16th, 2024
– Morning: Our team noticed disk replacement had been completed and began rebuilding the main database. Due to the low availability of read/write operations, the team decided to move the work to 3:00 p.m.
– 15:00: Resynchronization of the database began and continued until 21:22.
– 22:34: The first UI server in our main datacenter was brought online, and user traffic was directed to it.- 21:55: All infrastructure was restored to the pre-incident state.
Summary
The incident began with an unexpected shutdown of our main database server by OVH, followed by a series of events that required quick decision-making and adaptation. Our response involved switching to a replica database, purchasing new servers, and waiting for OVH to replace faulty disks. The performance issues were mainly due to disk I/O exhaustion on the replication server.
By breaking down these technical details, we hope to provide a clearer understanding of what happened and how we addressed the issue.
Next Steps
1. Review and Enhance Backup Procedures: Evaluate our current backup and failover strategies to ensure quicker recovery in future incidents.
2. Optimise Resource Allocation: Ensure that VMs have adequate CPU and I/O resources to handle unexpected loads.
3. Upgrade backup infrastructure: Plan to remove virtualisation from replica database and upgrade our backup infrastructure so it can handle full production traffic without performance degradation.
Informacje od fireTMS są cenne jak ładunek
Regularnie dostarczamy informacji o naszym systemie oraz na tematy z branży TSL.
Zapisz się do newslettera i bądź na bieżąco.
Tags
Rate the article!
0
0 ratings, avg: 0
Zespół fireTMS
Artykuł został napisany przez zespół fireTMS, w oparciu o wiedzę, doświadczenie i znajomość branży TSL.
Z tego artykułu dowiesz się o nowych funkcjach i ulepszeniach: Plik z rejestrem załadunków i rozładunków do 4Trans Z nową wersją FireTMS wprowadziliśmy możliwość wygenerowania pliku z rejestrem załadunków i rozładunków przeznaczonego do programu 4Trans. Zestawienie to można wygenerować z listy ze wszystkimi zleceniami, wybierając odpowiedni zakres dat. Dla każdego z kierowców zostanie wygenerowany wtedy […]
Chcesz zajmować się przewozem towarów i nie wiesz czy warto inwestować w ten rodzaj działalności? Dowiedz się, jakie wymogi musisz spełnić, aby założyć firmę transportową i czy jest to opłacalny biznes. Czym zajmuje się firma transportowa? Firma transportowa oferuje przewóz towarów pojazdami samochodowymi. Transport ładunków może odbywać się zarówno na terenie kraju, jak i poza […]
Z tego artykułu dowiesz się o nowych funkcjach i ulepszeniach: Dodawanie dokumentów metodą Drag and Drop Aby ułatwić i przyspieszyć pracę naszym użytkownikom, dodaliśmy możliwość zbiorczego załączania dokumentów do systemu metodą Drag and Drop. Wystarczy na przykład zaznaczyć preferowane pliki z komputera i przeciągnąć je do FireTMS do sekcji z dokumentami. Metodę tą można zastosować […]