Post Mortem - April 8 2024 Dyspatch Outage Intro

On April 8, 2024, Dyspatch was unavailable between the hours of 12:30PM and 01:00AM Pacific time due to an issue that occurred during a routine upgrade of Dyspatch's infrastructure. This post mortem aims to analyze the root causes of the outage, assess its impact on our services, and outline steps Dyspatch is taking to prevent similar incidents in the future.

Timeline (Pacific Time)

11:35 - We begin the upgrade
12:10 - The production cluster intermittently returns 503s for users. Dyspatch's services cannot communicate with each other.
12:17 - We attempt to rollback the changes.
12:30 - We identify the problem: the internal authentication mechanism our services use to communicate securely is out of sync across services.
12:30 - 17:30 - We try several strategies to bring production online.
17:30 - To avoid further impact to our production environment, work begins on our staging environment.
18:17 - We identify that previous changes were made to our staging environment without getting applied to our production environment.
21:16 - Staging is online. We begin applying the changes from our staging environment to our production environment.
00:56 - Dyspatch is available again.

Why did this happen? What did we learn?

During the outage we ran into several challenges trying to restore service. We discovered that a previous update to a critical component of our infrastructure was applied only to our staging environment. It was quickly determined that the issue was an authentication misalignment between Dyspatch's services which meant that our various services could not communicate with each other. We learned that we did not have a way to generate new credentials without taking the services that manage our cluster offline. After we determined that critical services had to be taken offline we switched to testing on our staging environment to prevent data loss in our production environment.

Ultimately a difference in our production and staging environment had knock-on effects affecting our ability to rollback and recover quickly.

What are we doing about it?

There are several actions we intend to take to prevent similar issues from happening:

We immediately aligned our staging and production environments to ensure that any infrastructure testing done in staging will be the same when applied to our production environment. The root cause of this outage came from a difference in environments and this ensures that we can be confident when testing required infrastructure changes.
We plan to invest in tooling to help us automatically catch and audit any drift between our environments. Catching the difference beforehand would have prevented this incident.
We are investing in tooling and processes to help us rebuild our cluster more reliably and quickly. We had to spend time migrating changes from our staging environment to our production environment when trying to restore Dyspatch.

Summary

Finally, we want to apologize. We know Dyspatch is important for supporting our customers' communications. Your patience and support mean a great deal to us and we appreciate everyone who reached out to our team. Like with any operational issue, we will spend time in the coming days and weeks to understand the details of the event and make improvements mentioned above to our infrastructure and processes.

Posted Apr 16, 2024 - 13:23 PDT

Resolved

This incident has been resolved.

Posted Apr 09, 2024 - 02:00 PDT

Monitoring

A fix has been implemented and we are monitoring the results. Thank you for your patience.

Posted Apr 09, 2024 - 00:58 PDT

Update