Transaction gateway: Unavailability

Incident Report for Clearhaus

Postmortem

On Friday March 23 2018 we had an incident during a routine deployment of our production environment, which lead to a downtime of approximately 35 minutes. We deeply regret this and offer our sincere apologies. Our goal is to provide a stable and solid service, something we believe we have achieved, so an incident like that of last Friday, hurts our professional pride.

After a thorough investigation of the incident we know exactly what caused the incident and what we can do in the future to prevent incidents like this one. Other than leading to a change in our deployment procedures, we would like to share with you some details about the incident.

Incident Details

We have three environments: testing (for testing, that we break once in a while), staging (to be just like production, but without production data; this is where partners integrate into, gateway.test.clearhaus.com), and production. Changes are first deployed to testing, then to staging, then to production. After updating testing and staging, we encountered an issue where the security update had caused one of the servers to fail during startup. This was not detected in either testing or staging. This caused no issue, as we start our new servers separately from the servers processing actual transactions and test them before adding them to the load-balancer distributing requests.

The issue was an easy issue to resolve and it was decided to fix the issue in this deployment and move it directly into production. We believed we had fixed the issue and started the deployment procedure anew. At this point we had 3 functioning old servers and 3 non-functioning new servers not behind the load-balancer. While updating the environment configuration, the number of running servers was automatically scaled back to 3 and the functioning servers were undesirably terminated. This was a human mistake since the non-functioning new servers should have been discarded before continuing the deployment (partial roll-back).

We received alerts almost immediately and quickly diagnosed the issue, deciding that it was best to wait for the new machines we had just started to finish booting. Sadly, the fix we thought we had in place did not solve the problem, something that would not normally cause an issue.

When this was realized, it was decided to roll-back the servers to the previous functioning version. When servers are terminated, they cannot be recreated, new instances must be launched. The environment was then adjusted to boot new instances with the old, functioning version of the software.

Lessons learned

There are several lessons to be learned from this incident.

The original issue was not caught in neither testing or staging although it would have been possible.
The tooling will under the correct circumstances automatically terminate servers when the software stack is updated.
The rollback procedure took far too long, practice in performing rollbacks could shorten the rollback time.

As a consequence of this, we have implemented new operating procedures for deployments, greatly diminishing the chance that this event could occur in the future. This includes removing the possibility for our tooling to scale down servers running in production and establishing clear procedures how to proceed when failures are detected during deployment.

Posted Apr 04, 2018 - 09:53 UTC

Resolved

A change caused the gateway to not accept incoming connections. The service was partially unavailable from 16:04:14 UTC to 16:39:00 UTC.

In this time window the gateway either did not respond, or answered authorizations with an error while captures succeeded.

The cause of the incident is clear and counter-measures and new procedures will be employed to prevent incidents like this in the future.

Posted Mar 23, 2018 - 17:28 UTC