Learning From Failures: Leveraging Postmortems for Good
As engineers, we love solving problems and digging into the details. I know that I feel a particular sense of joy when shipping a system to production for the first time.
However, once a system goes into production, it's going to fail. That's not a knock on engineering, that's a fact of our industry. We cannot build a system with 100% uptime, no matter how much we plan ahead or think about.
When this happens, our job is to fix the issue and bring the systems back up.
A common mistake that I see teams make is that they'll fix the problem, but never dig into the why it happened. Fast forward three months, and the team will run into the same problem, making the same mistakes. I'm always surprised when teams don't share their knowledge because if you've paid the tax of learning from the first outage, why would you pay the same tax to learn the lesson again?
This is where having a postmortem meeting comes into play. Borrowing from medicine, a postmortem is performed when the team gets together to analyze the outage and what could have been done differently to prevent it from happening. Other industries have similar mechanisms (for example, professional athletes review their games to learn where they made mistakes so that they can train differently).
One of the key concepts of the postmortem is that the goal isn't to assign blame, but to understand what happened and why. People aren't perfect and it's not reasonable for them to be. This concept is so fundamental that another term you might here for a postmortem is a blameless incident report (BIR).
Are you interesting learning how to run your own BIR process? Drop me a line and if there's enough interest, I'll write a follow-up post!
Getting Started
Every company has their own process when they have an outage, but health process should be able to answer these five questions at a minimum.
- What stopped working?
- Why did it stop working?
- How did we fix it?
- What led to the system breaking?
- What are we doing to prevent this issue from happening in the future?
Organizations may have additional questions for their process, for example
- What was the impact? (X customers were impacted for Y minutes)
- How was this discovered? (Was it user reported, support found it, our monitoring tools paged on-call)
You can always make a process more complicated, but it's hard to simplify a process, so my recommendation is start with the 5 questions and then expand as your team evolves and matures.
What Stopped Working?
For there to be an outage, something had to stop working, right? So what was it? It's okay/normal to be a bit technical here, however, don't forget that this outage caused an impact to our users, so we should strive to explain the outage in those terms.
Another way to think about this is "What stopped working and what impact did it have for our users?"
Here's a not-great answer to the question
The Data Sync Java container stopped working.
While this does capture what stopped working, the details are quite vague. For example, did it not start? Did it start crashing? Was it the whole process that failed or only one part?
In addition, there's not a clear delineation of what the user impact was. For example, could users not log into the application at all? Were there certain parts of the application that stopped working?
We can improve this by including a bit more details in what specifically stopped working.
The Data Sync process was failing to connect to the UserHistory database.
Ah, okay, now we know that the sync process wasn't able to connect to a specific database and we can start getting a sense of why that would be a big deal. We're still not including the user impact, so let's add that bit of detail in.
The Data Sync process was failing to connect to the UserHistory database. As such, when users logged into their account, they could not see the latest transactions for their account.
Much better, now I know that our users couldn't see recent transactions and it was something to do with the Data Sync process.
As a side benefit, if I'm a new engineer to this codebase or team, I know know more about the architecture and that this process is involved for when users start transactions.
Why Did It Stop Working?
At some point, the system was working and if it's no longer working, something had to have happened, so what was it? Was there a code deployment to production? Did a feature flag get toggled that had an adverse effect? What about infrastructure changes like DNS entries or firewall?
This is a key critical step because if we don't know why it stopped working, then we don't have a good spot to start when we start diving into the circumstances behind what led to the outage.
This doesn't have to be a page worth of technical deep dive, a couple of sentences can suffice here. In our outage, the issue was due to a port being blocked by the firewall (where it wasn't before).
There was an infrastructure change for the database that blocked port 1433, which is the default port for the database. Because of this change, no application was able to successfully connect to the database.
How Did We Fix It?
If you've gotten to the point of writing the BIR, then you've fixed the issue and the system is up and running again, right? So what did you do to fix the issue? Did you rollback the deployment? Disable a feature flag? Burn down the application, change your name and get a new job? This is a cool part of the BIR because you're leveling up others that if they run into a similar issue, here's how you were able to get back up and running.
In our example, we were able to unblock the port, so we can answer this question with:
Once we realized that port 1433 was being blocked by the firewall, we worked with the Infrastructure team to unblock the port. Once that change was completed, the Data Sync service was able to start syncing data to the UserHistory database.
What Led To the System Breaking?
This is where the meat of the conversation should take place. In a healthy organization, we assume that people are wanting to do the right thing (if not, you have a much bigger problem that BIRs). So we've got to figure out how did we get here, what were the motivations and why did we do the things that we did?
One common approach is 5 Whys, made popular by the Toyota Production System. The idea is that we keep digging into why something happened and not stop at symptoms.
An example 5 Whys breakdown for this outage could be the following:
As you can see, this approach brings up lots of questions, including the motivation behind the work and why the team was doing it anyway. It is possible
What Are We Doing To Prevent Similar Issues in the Future?
The system is going to have an outage, that's not up for debate. However, it would be foolish to have an outage and not do anything to prevent it from happening in the future. If we already paid the learning tax once, let's not pay it again for the same issue.
This should be a concrete list of action items that the team takes ownership of. In some cases, it's work that they can do to prevent the issue going forward. In some cases, it could be working with others to help them fix things in their process.
In our hypothetical outage, we could have the following action items
- Add Additional Monitoring for the Data Sync process
- Work with Security to determine approach for securing our database instance
- Create environment diagram for Data Sync process
- Create architecture diagram for Data Sync process
Example Blameless Incident Report
In this post, we covered the 5 main questions to answer for a BIR and what good responses look like. In this section, I wanted to go over an example BIR for the database issue that we've been exploring. As you'll see, it's not a verbose document, however, it does capture the main points and this is easily consumable for other teams to learn from our mistakes.