Incidents and Postmortems on the Management level

Not only do systems fail

Marcelo Oikawa
3 min readMay 20, 2022
Photo by Paul Hanaoka on Unsplash

A colleague went out on vacation. We were launching a sensitive internal tool to a few external users aiming to do a gradual rollout. They did a detailed handover document to another team member to ensure that everything went well. Everything was alright. Then, two days before the launch, we found out that we’ve asked for permissions for users that weren’t part of the test phase. In addition, they needed to give us support during their vacation, and, in the end, we delayed the launch for two days.

Trust me, stories like this are more common than you can imagine in a dynamic tech company. But how can we avoid the same problem in the near future?

A failure it’s only “failure” if you don’t learn from it

When I was an engineer, I learned that companies that have adopted a Postmortem Culture are more willing to build solid and reliable systems. Could we also use the Incident/Postmortem process at the management level? I mean, for non-systems related? I’d say, “yes, we can.”

Incidents and Postmortems in a nutshell

What defines an incident? Well, it depends. When we talk about systems, it can be the number of errors, breaking SLA, service unavailability, etc. IMO what defines an incident is that we do have users being affected.

The Postmortem process brings people together to discuss the details of an incident: why it happened, its impact, what actions were taken to mitigate it and resolve it, and what should be done to prevent it from happening again. Another essential point is that it should be done by collaboration and be blameless; it means keeping it constructive without pointing fingers.

“Removing blame from a Postmortem gives people the confidence to escalate issues without fear” — John Lunney and Sue Lueder

Problems will always occur, whether in systems or processes. It’s not people’s fault. The main takeaway is that we must learn from it when it happens.

Why am I suggesting to use at Management Level?

Honestly, I think it’s a magnificent way to explain the problem from the beginning until the solution, impacts, and lesson learned. Not to mention, it’s also scalable because it’s easy to share within the whole company. The right question should be, “why not?”

You can think of many other examples when a Postmortem could be applied to make the company grow by learning. I remember one Postmortem that I did in my previous company when we fired an employee in their first six months due to performance misalignments. Why? A deeper investigation showed an issue during their hiring process. Many interviewers have said “no hire” until one said “hire” without seeing the previous reviews. We’ve adjusted the process to avoid “infinite retries” among the interviewers, and we promoted metrics to identify when interviewers should be re-trained. In this example, no system was involved again. Clearly, we’ve changed the process to avoid the same mistake.

In the example shown at the beginning of this post, the major learning was to do handovers, including more than one person accountable for the job, something similar to the four-eyes principle. The accountable person should also walk through the documentation and do the following steps by themselves, simulating the real scenario before the official transition of responsibilities. This is an excellent way to verify if the documentation is clear enough.

Remember that Postmortem is not only the document per se. It’s the whole process that starts investigating the incident’s root cause. The document format may vary. Please adjust it for your convenience and keep in mind that it should not be tedious. Even if we spend some time doing it, the objective must be to make the readers aware of the learnings to avoid the same impacts. Nobody wants to see managers making the same mistakes here and there recurrently.

--

--

Responses (1)