Transaction gateway: Authorizations failing
Incident Report for Clearhaus
Postmortem

I am sorry for the transactions your customers could not complete during the time spans our authorization service was not functional. We care a lot to deliver a stable service, and strive to learn from failures like this.

We have investigated the incidents of November 15th together with our partner and would like to let you know what happened and what will be done to prevent similar failures.

Root cause

A missing database commit in one of our partner’s databases had the consequence that, after all pooled database connections were used, no more rows could be added, and authorizations would fail. After a manual failover, the authorization system was again functional. However, the same root cause was present on the backup system; this was the reason for the second incident, which was resolved by restarting the primary system and failing back again.

Later, during investigation of the incidents, the authorization system was noticed to have heavier load than usual. The investigation of this lead to identify the root cause, the missing database commit, which was then solved.

Future prevention from our partner

Our partner has temporarily put an extra review step in place for changes that could have similar consequences. A long term solution has been planned which involves improving tests run in relation to such database changes.

Improvements by us

It is possible for us to fail over to a completely separate system. This is usually done by our service partner, but can be done by us. Failing over does have some undesirable consequences but we have revised this tradeoff and will actively fail over to the separate system; short term, manually by our 24-7 on-call team; long term, we will work on an automated failover mechanism.

We noticed that our monitoring and alerting is functioning quite well, but we would like to be updating the status page faster to let you know about incidents.

Questions

If you have follow-up questions, please reach out by email (support@clearhaus.com), twitter (https://twitter.com/clearhaus) or even by phone (+45 8282 2200).

Sincerely,

Casper Thomsen, Operations

Update 2016-12-08

The long-term automated fail-over mechanism has been in production for quite some time now, namely since February 25th. It has actually helped us a few times saving transactions, improving our service for the advantage of our merchants. Also, the undesirable consequences has been limited, so the trade-off seems to be right.

Posted Nov 23, 2015 - 15:23 UTC

Resolved
The issue has been resolved by our service partner. We are monitoring this closely for the next hour.
Authorization traffic was failing between 15:24:10 and 15:56:42 UTC.
Posted Nov 15, 2015 - 16:06 UTC
Identified
Authorization traffic leaving Clearhaus towards card schemes is not forwarded by our service partner who has acknowledged the problem and working on a solution.
This incident does not affect capture and refund traffic.
Posted Nov 15, 2015 - 15:44 UTC