As a follow-up on the incident on 2024-09-12, this is our findings from analyzing the sequence of events. Again, we sincerely apologize for the disruption, downtime and inconvenience.
During the timespan 2024-09-12T03:30Z - 2024-09-12T11:45Z, we experienced connection failures and timeouts to an upstream provider. This affected live transactions: authorizations and voids for both Visa and Mastercard as well as credits for Mastercard. After confirming that the issue was entirely external, and that we were thus unable to mitigate it ourselves, we reached out to our provider, opened an issue with them and escalated immediately.
Our automated failover moved traffic to the provider’s secondary data center when the first transactions failed towards their primary data center. While this is normally sufficient to mitigate small hiccups, this showed that the secondary data center was also affected by the incident. No transactions towards the second data center were approved and we therefore decided to only target the primary data center as we saw upwards of 10-15 % of the transactions going through there. We periodically sent small bursts of transactions to the secondary data center to monitor availability. There was a small window of 10 minutes where transactions were processed through the secondary data center before it became dysfunctional again. We contacted our provider to inform them that the secondary data center had been in a working state.
After approximately 8 hours, our provider was able to mitigate the issue in both data centers, which resolved the incident, and approval rates went back to normal.
For a short period of time during the incident our services were exhausted which exacerbated the condition. This resulted in the transaction gateway responding HTTP 504 Gateway Timeout which affected all transaction types. We are sincerely sorry for the inconvenience and we are working to prevent this from happening in the future.
While we do not have any direct involvement in the incident at our provider we have learned a thing or two to take with us:
Already implemented improvements: