Transaction gateway: Increased failure rate and processing times
Incident Report for Clearhaus
Postmortem

We are sorry about the inconvenience caused by this incident. Investigation has shown that a busy database and unfortunate timing of events lead to the incident. Here we will explain what caused the incident, quantify the impact seen from our side, explain how we mitigated the issue, as well as explain the improvements and actions we have already taken and those that are in the pipeline.

The root cause

We have recurring jobs that clean up data; for instance, it is important to get rid of sensitive data like card numbers etc. Such a job ran Monday 2020-08-17 leading up to the incident. The data being cleaned up by such jobs are increasing over time, thus the jobs have been made to run efficiently. Very unfortunate, the amount of data and the efficiency of the execution ended up just exactly depleting our primary database’s ability to handle bursts of traffic. The job ran a bit slower than usual in the end because it was throttled, however, it also left the database without the ability to handle bursts.

The ability to burst is regained over time, but the transaction pattern that followed just exactly kept the burst balance down. Due to the database’s inability to handle bursts and due to database-wise heavy transaction rules we started seeing errors. Internally we are in the process of cleaning up such heavy rules that are in some cases superfluous, however, the task is not yet complete. (See action 1.)

Quantification

We will only handle the three major transaction types. The analyzed period is 14:50-17:00 UTC.

Authorizations: The average authorization time increased tenfold, and during the worst period the average was 20 times higher than usual. 90% of the authorization requests were responded to within 20 seconds, and 74% within 10 seconds. Less than half a percent of the authorization requests were not responded to.

Captures: Captures were impacted the most. During the analyzed period captures frequently exceeded the maximum processing time of 60 seconds. The average capture processing time of approximately 200 ms was bumped to a whopping 17 seconds during the incident. 74% of the capture requests were responded to within 20 seconds, and 65% within 10 seconds. Approximately 18% of the capture requests were not responded to.

Refunds: The average processing time for a refund went from our usual 200 ms to approx 4 seconds. 73% of the refund requests were responded to within 20 seconds, and 45% within 10 seconds. Approximately 4% of the refund requests were not responded to.

Mitigation

The unfortunate timing of the recurring clean-up job and transaction pattern initially led us to believe that the issue was elsewhere, namely that it was caused by the heavy transaction rules. Therefore, we started identifying accounts having heavy rules burdening the database and adjusted their rules. While this decreased both the number of timeouts and the average transaction processing time, it could not solve the problem that the database was unable to handle bursts. It simply did not scale to manually go through accounts and their rules. When this was clear to us, we started investigating how to avoid the throttling in a better way. Essentially there was no upgrade path because all upgrade paths would put further pressure on the database for an extensive amount of time. We initiated the process to test the least pessimistic looking upgrade path, but decided to try to come up with alternatives in parallel. We did come up with another strategy that proved effective, namely a coarse-grained, automatic filtering of the transaction rules and thus removal of most burst-dependent queries from the database.

When we saw that we had effectively mitigated the issue and saw the ability to burst being regained, we planned and executed a rollback of the transaction rule filter in a controlled manner, thereby bringing the systems back to normal.

Improvements

We always strive to be better and learn from our and others’ mistakes. We have already learned quite a bit from this incident and have initiated multiple improvements to our organisation, incident handling procedures, and IT systems. We have a number of actions that need to be taken immediately and some that shall be taken in near future after further planning and investigation.

Immediate actions

  1. Adjust the process to clean up unnecessary heavy rules in order to speed it up. The clean-up is expected to be completed within a few weeks. We are currently adjusting our tooling to accommodate a faster process. What rules are necessary is a risk decision, so unfortunately it is not easily automated for all accounts.
  2. Extend our alerting to notify the on-duty team when the ability to burst is being used heavily and when it has crossed an unacceptable threshold. Adjust the clean-up process of e.g. sensitive data to ensure that it does not impact our ability to process transactions.
  3. Emphasize in our incident handling procedure that early announcement shall be considered. When the incident happened, all focus was put on mitigation. Though, a swift announcement to confirm that we noticed the incident and are fully focused on mitigating would have helped you, our partners and customers.
  4. Improve our insights into specifically how our databases are keeping up with increasing workloads. Our current metrics and insights showed to be insufficient or too coarse.

Improvements in the future

  1. Increase various properties of our databases; the intention is to be able to handle larger bursts.
  2. Separate the transaction rule handling from the transaction processing.

Questions

If you have follow-up questions, please reach out to us by email (support@clearhaus.com) or phone (+45 8282 2200).

Sincerely,

Casper Thomsen, Operations

Posted Aug 20, 2020 - 14:13 UTC

Resolved
The rollback went fine and systems are now fully back to normal. We expect no further incident updates.


Guidance for our partners, added 2020-08-18

Since Clearhaus systems only have our view of the transaction state, to ensure consistency, we encourage our partners to retrieve a list of transactions from either the Merchant API or the Dashboard. The period 14:50-17:00 UTC on 2020-08-17 will cover all potential inconsistencies caused by this incident. Captures and refunds are the primarily affected transaction type.

If inconsistencies are found, we recommend that consistency is established and that merchants are encouraged to review the affected transactions for the period.
Posted Aug 17, 2020 - 23:04 UTC
Update
Firstly, we can confirm that the Merchant API and Gateway API was consistent with the exception of exactly 1 authorisation; this has been re-sent, and the PSP in question has been informed.

The mitigation that we put in place and announced here 15:27 UTC was effective in mitigating the issue. This temporary mitigation is being rolled back in a careful manner over the coming hours while we are tightly monitoring the systems.
Posted Aug 17, 2020 - 20:19 UTC
Update
Our monitoring shows that our current mitigation has been effective. The processing time and error rates should be back to normal.

If you have experienced timeouts for capture or refund processing, and will attempt to resolve quickly we have this guidance: If the transaction is for the full amount, the transaction can be retried.
If the capture was already processed for a given timeouted transaction, you will on the retry receive HTTP 400 with status code 40000 and status message “no remaining amount”. Likewise for refunds.
At this time we are not 100 % sure if the Merchant API is consistent with the Gateway API. We will investigate this part and ensure to correct the data behind the Merchant API in case of any inconsistencies. Please await an update within a few hours regarding this.
Posted Aug 17, 2020 - 17:27 UTC
Monitoring
Since approximately 14:50 UTC we have experienced increased processing times and increased failure rates, however, we believe the issue is now resolved.
We will provide a post mortem when a thorough investigation has taken place.
Posted Aug 17, 2020 - 17:01 UTC
This incident affected: Payment Processing APIs (Gateway API (gateway.clearhaus.com)).