We are sorry about the inconvenience caused by this incident. Investigation has shown that a busy database and unfortunate timing of events lead to the incident. Here we will explain what caused the incident, quantify the impact seen from our side, explain how we mitigated the issue, as well as explain the improvements and actions we have already taken and those that are in the pipeline.
We have recurring jobs that clean up data; for instance, it is important to get rid of sensitive data like card numbers etc. Such a job ran Monday 2020-08-17 leading up to the incident. The data being cleaned up by such jobs are increasing over time, thus the jobs have been made to run efficiently. Very unfortunate, the amount of data and the efficiency of the execution ended up just exactly depleting our primary database’s ability to handle bursts of traffic. The job ran a bit slower than usual in the end because it was throttled, however, it also left the database without the ability to handle bursts.
The ability to burst is regained over time, but the transaction pattern that followed just exactly kept the burst balance down. Due to the database’s inability to handle bursts and due to database-wise heavy transaction rules we started seeing errors. Internally we are in the process of cleaning up such heavy rules that are in some cases superfluous, however, the task is not yet complete. (See action 1.)
We will only handle the three major transaction types. The analyzed period is 14:50-17:00 UTC.
Authorizations: The average authorization time increased tenfold, and during the worst period the average was 20 times higher than usual. 90% of the authorization requests were responded to within 20 seconds, and 74% within 10 seconds. Less than half a percent of the authorization requests were not responded to.
Captures: Captures were impacted the most. During the analyzed period captures frequently exceeded the maximum processing time of 60 seconds. The average capture processing time of approximately 200 ms was bumped to a whopping 17 seconds during the incident. 74% of the capture requests were responded to within 20 seconds, and 65% within 10 seconds. Approximately 18% of the capture requests were not responded to.
Refunds: The average processing time for a refund went from our usual 200 ms to approx 4 seconds. 73% of the refund requests were responded to within 20 seconds, and 45% within 10 seconds. Approximately 4% of the refund requests were not responded to.
The unfortunate timing of the recurring clean-up job and transaction pattern initially led us to believe that the issue was elsewhere, namely that it was caused by the heavy transaction rules. Therefore, we started identifying accounts having heavy rules burdening the database and adjusted their rules. While this decreased both the number of timeouts and the average transaction processing time, it could not solve the problem that the database was unable to handle bursts. It simply did not scale to manually go through accounts and their rules. When this was clear to us, we started investigating how to avoid the throttling in a better way. Essentially there was no upgrade path because all upgrade paths would put further pressure on the database for an extensive amount of time. We initiated the process to test the least pessimistic looking upgrade path, but decided to try to come up with alternatives in parallel. We did come up with another strategy that proved effective, namely a coarse-grained, automatic filtering of the transaction rules and thus removal of most burst-dependent queries from the database.
When we saw that we had effectively mitigated the issue and saw the ability to burst being regained, we planned and executed a rollback of the transaction rule filter in a controlled manner, thereby bringing the systems back to normal.
We always strive to be better and learn from our and others’ mistakes. We have already learned quite a bit from this incident and have initiated multiple improvements to our organisation, incident handling procedures, and IT systems. We have a number of actions that need to be taken immediately and some that shall be taken in near future after further planning and investigation.
If you have follow-up questions, please reach out to us by email (support@clearhaus.com) or phone (+45 8282 2200).
Sincerely,
Casper Thomsen, Operations