On April 8, 2024, Dyspatch was unavailable between the hours of 12:30PM and 01:00AM Pacific time due to an issue that occurred during a routine upgrade of Dyspatch's infrastructure. This post mortem aims to analyze the root causes of the outage, assess its impact on our services, and outline steps Dyspatch is taking to prevent similar incidents in the future.
11:35 - We begin the upgrade
12:10 - The production cluster intermittently returns 503s for users. Dyspatch's services cannot communicate with each other.
12:17 - We attempt to rollback the changes.
12:30 - We identify the problem: the internal authentication mechanism our services use to communicate securely is out of sync across services.
12:30 - 17:30 - We try several strategies to bring production online.
17:30 - To avoid further impact to our production environment, work begins on our staging environment.
18:17 - We identify that previous changes were made to our staging environment without getting applied to our production environment.
21:16 - Staging is online. We begin applying the changes from our staging environment to our production environment.
00:56 - Dyspatch is available again.
During the outage we ran into several challenges trying to restore service. We discovered that a previous update to a critical component of our infrastructure was applied only to our staging environment. It was quickly determined that the issue was an authentication misalignment between Dyspatch's services which meant that our various services could not communicate with each other. We learned that we did not have a way to generate new credentials without taking the services that manage our cluster offline. After we determined that critical services had to be taken offline we switched to testing on our staging environment to prevent data loss in our production environment.
Ultimately a difference in our production and staging environment had knock-on effects affecting our ability to rollback and recover quickly.
There are several actions we intend to take to prevent similar issues from happening:
Finally, we want to apologize. We know Dyspatch is important for supporting our customers' communications. Your patience and support mean a great deal to us and we appreciate everyone who reached out to our team. Like with any operational issue, we will spend time in the coming days and weeks to understand the details of the event and make improvements mentioned above to our infrastructure and processes.