We have verified that the cause of the outage was indeed a database failure. One of the instances in our high-availability (HA) database setup became dysfunctional in a silent way not immediately noticeable by the built-in monitoring. The failed instance had working replication but did not respond to new queries and did not report any metrics. The monitoring should have noticed the failure immediately and performed a quick fail-over to another instance, however, it seems the monitoring logic was tricked by the working replication and the lack of metrics. After 15 minutes a timeout in the logic finally kicked in and provoked a fail-over.
We identify the late fail-over as the main issue in the outage, as single instances must be expected to fail from time to time in an HA setup. We should have had 30-60 seconds of instability but ended up having 17-18 minutes of regular downtime. Unfortunately, our database vendor has not yet been able to find a clear reason for the late fail-over. It is our understanding, however, that they are working on improving their systems’ monitoring logic and that this incident is valuable input to that work.
During the downtime we witnessed that our applications did not handle prolonged database issues very well. We need the applications to fail early in a controlled way rather than continue to spawn new, hanging database connections. Adding a circuit breaker to the database layer is not a simple task to be completed in near future, however, it is now on our roadmap. The fact that application threads kept being hanged in the 17-18 minutes long period, rather than terminating early, means that clients experienced an unsatisfactory number of lost responses. We want to be better than that.