Connectivity issue towards upstream
Incident Report for Clearhaus
Postmortem

Introduction

As a follow-up on the incident on 2024-09-12, this is our findings from analyzing the sequence of events. Again, we sincerely apologize for the disruption, downtime and inconvenience.

The issue

During the timespan 2024-09-12T03:30Z - 2024-09-12T11:45Z, we experienced connection failures and timeouts to an upstream provider. This affected live transactions: authorizations and voids for both Visa and Mastercard as well as credits for Mastercard. After confirming that the issue was entirely external, and that we were thus unable to mitigate it ourselves, we reached out to our provider, opened an issue with them and escalated immediately.

Our automated failover moved traffic to the provider’s secondary data center when the first transactions failed towards their primary data center. While this is normally sufficient to mitigate small hiccups, this showed that the secondary data center was also affected by the incident. No transactions towards the second data center were approved and we therefore decided to only target the primary data center as we saw upwards of 10-15 % of the transactions going through there. We periodically sent small bursts of transactions to the secondary data center to monitor availability. There was a small window of 10 minutes where transactions were processed through the secondary data center before it became dysfunctional again. We contacted our provider to inform them that the secondary data center had been in a working state.

After approximately 8 hours, our provider was able to mitigate the issue in both data centers, which resolved the incident, and approval rates went back to normal.

For a short period of time during the incident our services were exhausted which exacerbated the condition. This resulted in the transaction gateway responding HTTP 504 Gateway Timeout which affected all transaction types. We are sincerely sorry for the inconvenience and we are working to prevent this from happening in the future.

Remediation

While we do not have any direct involvement in the incident at our provider we have learned a thing or two to take with us:

  1. We will investigate the possibility to void authorizations with status code 50000 such that clients can void these transactions to ensure the state at upstream is the same representation as at the client. At the same time we will highlight that the transaction representation in the Merchant API will help in this regard to understand the state of a transaction. For critical transactions (e.g. credits or “large enough” authorizations) we recommend to use the Merchant API to check the state of a transaction when you did not receive a response (no matter if the failure happened on our end, somewhere in the middle, or on your end).
  2. To avoid HTTP 504 responses, we will go over our resource allocation and adjust to have a better tolerance. In addition we will improve alerting on our application load balancer to highlight when our internal systems are responding slowly and thus indicating a resource exhaustion in our system.
  3. Investigate if we can improve upon the automated failover mechanism to balance traffic so we do not need to manually send small bursts to test availability. This could potentially have made us earlier aware when the secondary came up both for midways in the incident and in the very end of the incident.

Already implemented improvements:

  1. Extend internal dashboards to give improved visibility in the continuous success rates on the individual upstream data centers.
Posted Oct 02, 2024 - 12:39 UTC

Resolved
Our upstream partner has resolved their issue and our monitoring shows that the connection has been stable and without problems since our last update. We will investigate mitigating actions and the root cause of this incident and will publish a postmortem once we have a complete overview from the upstream partner.

To summarize, we saw on average around 90 % of authorizations impacted during the time period 2024-09-12T03:30 - 2024-09-12T11:45 UTC. All impacted authorizations were responded with status code 50000 by our gateway API for transaction processing.

We sincerely regret the inconvenience this serious incident has caused!
Posted Sep 12, 2024 - 20:39 UTC
Monitoring
We have seen a significant increase in approval rates and the connectivity is now back to normal. Due to the severity of this issue and the previous sudden change in approval rate we are still actively monitoring this situation.
Posted Sep 12, 2024 - 12:04 UTC
Update
We are again seeing an improvement in approval rate starting 11:46 UTC and are continuing to monitor the situation as we previously have seen such improvement temporarily.
Posted Sep 12, 2024 - 11:51 UTC
Update
We are awaiting further communication from our provider. We are actively monitoring the situation. There is no ETA for when this issue is resolved. We will update as soon as we know more.
Posted Sep 12, 2024 - 10:47 UTC
Update
We saw an improvement in approval for a short period of time from around 08:47 UTC to 08:59 UTC. Unfortunately, this has worsened again, and we are back on approval rates below 15 %.
Posted Sep 12, 2024 - 09:06 UTC
Update
Unfortunately, we still see very high failure rates on authorizations. Our upstream provider is still trying to mitigate this critical incident.
We are seeing examples where amounts are being reserved on cardholders' accounts despite the authorization failing. Therefore we recommend that subsequent-in-series recurring and unscheduled (e.g. subscriptions) are not attempted until after this incident is resolved. Furthermore, we are exploring options to potentially release these reservations.
Posted Sep 12, 2024 - 08:41 UTC
Identified
We have identified that the issue is with our upstream provider. We have attempted mitigating actions but unfortunately with very limited impact. We are still seeing very low approval rates that fluctuates approximately from 0 % to 15 %. Our upstream provider is trying to resolve the connectivity issue.
Posted Sep 12, 2024 - 06:34 UTC
Update
Our upstream provider has confirmed and is aware of a connectivity issue. Currently almost all authorizations, voids and Mastercard credits are impacted.
Posted Sep 12, 2024 - 05:05 UTC
Investigating
Since 2024-09-12T03.39Z we have observed connectivity issues towards an upstream provider. We have reached out and are investigating with the provider.
Posted Sep 12, 2024 - 04:42 UTC
This incident affected: Payment Processing APIs (Gateway API (gateway.clearhaus.com)).