Embracing Blameless Analysis in Engineering: A Game-Changer for Team Dynamics and System Reliability

Can you really be a software engineer if you haven’t ever made a mistake in the code base? Recently, NSS Software Engineering Instructor, Jody Alford, shared his perspective on the importance of blameless analysis with our Seekers, recent grads who are still searching for their first job in tech. As humans, we all make mistakes; and we know computers aren’t perfect either – hello generative AI. In fact, Jody recalls making a mistake that lost his company more than his salary. 👀 But if we shift from blaming people to blameless analysis, we can improve team dynamics and even system reliability.

Blameless analysis allows us to look at an incident objectively and discover the root cause that led to the incident and solutions to prevent it from happening again – all without assigning personal blame. But before we can understand blameless analysis, we need to understand blame and how it impacts people and analysis.

Understanding Blame in Engineering

So what is blame? Blame assigns fault or responsibility to a person based on the actions they have (or have not) performed. In an engineering context, this action could be one that impacts users, such as bringing your system down or reducing the effectiveness of your website. It could be in the form of data loss. Your database table gets dropped. You delete a bunch of rows. You drop the entire database on accident. You lose information in transit. It could be something that requires a production intervention, such as a team needing to make changes to software that has been deployed to production as a result of some event or failure. Blame can also transpire when there is slow issue resolution, monitoring failures, or development delays.

When someone makes a mistake, we feel the need to point it out and have them change their behavior. But we also sometimes feel the need to couple that person with their actions, which often presents itself through “could have” or “should have” language. These statements are full of blame.

By blaming you, I’m denying your humanity.” - Dave Rensin, Distinguished Engineer at Google

We come together as software engineers to solve hard problems. That collaboration allows us to create these grand software projects, and with any sufficiently complex system, errors are going to happen. Eventually, with enough involvement, you will likely be responsible for one of those errors. By removing blame from the analytical context when we are trying to resolve an issue, we are accepting that we too are just as capable of error as is our teammate.

Blame Creates Risk

Being blamed doesn’t feel good. So when we have blameful environments where everyone is focused on how Jody made that mistake and getting Jody to do things correctly, it creates a reluctance to admit it when you make a mistake. This could lead to it taking longer or being harder to solve the issue. It can lead to less incentive to innovate and it stifles learning. We don’t take as many chances if we are going to have strong repercussions or blame for errors.

We are software engineers. We’re innovators. We want to create new things. Taking risks is a part of our job.

So how do you analyze something blamelessly?

The Shift to Blameless Analysis

For blameless analysis to be effective, we have to start with the premise that everybody did their best with the information they had at that time. Whether or not this statement is true is not relevant. Our goal is to apply this method to expose system vulnerabilities and identify changes that we can make – and hopefully fix the issue so that the same person can’t make this error again.

The unexpected nature of failure naturally leads humans to react in ways that interfere with our understanding of it.” PagerDuty

It’s important to note what blameless analysis is not.

It’s not a mechanism to avoid accountability.
It’s not to be applied to every encounter.
It’s not an acceptance of the issue being addressed.
It’s not designated to make those involved feel better (that’s just a side benefit).
It’s not an opportunity to apply pressure to an individual or team.
It’s not to be underestimated.

We’re trying to identify problems in the system so that we can correct them. This is an analysis tool, not a philosophy. We still want to make changes and improvements. Our goal with blameless analysis is system improvement.

Sometimes blame is warranted, but unless that person is your direct report. You'll likely not have the authority to do anything about it. That's a problem for your team's leadership or human resources. Blamelessness is just a hat that you need to be able to put on when the situation calls for it.

What does Blameless Analysis look like?

Blameless analysis is about information discovery and sharing, provides constructive feedback, and is human centric. Everyone involved is working with the same information. Causes are identified and shared without fear of retaliation or blame. The intent is to understand why it made sense for a person to behave the way that they did, with the assumption that they believed they were supposed to behave that way.

Since the focus is on improving a system, blameless analysis gives us a mechanism for constructive feedback. The scope of that feedback is limited to just the issue we’re trying to resolve. These systems that we are designing are complex. Many systems that I've worked on have had more code, more individual projects, than any one person could really understand as a whole. And yet we had a team of a hundred people working on it, interspersed, all these had dependencies on it. This complexity is a problem. It's more than our brains can handle in many situations. So accepting this failure is human centric and using this tool helps us work around that and keeps us focused on making improvements.

When it comes to decisions and actions, we want to know how it made sense for someone to do what they did. And make no mistake: they thought what they were doing made sense. Otherwise, they wouldn’t have done it.” - John Allspaw

Tools for Blameless Analysis

Be Aware of Cognitive Bias

Being aware of our cognitive biases can be one of the simplest ways that we can offset blame or eliminate it from our vocabulary. We have to recognize that we have these mental traps that we can fall into, such as the idea of a fundamental attribution error, that what people do reflects their character. For example, this person is tardy. He's the type of person who's just late all the time. But we don't know their circumstance, we don't know why they're late in that particular instance. So it’s important to be careful about trying to assert that the behavior that we're seeing now is a trend that is going to continue with this person as a result of their character. We can counteract this by focusing on the situational causes rather than the actions of the individual.

Likewise, we want to avoid confirmation bias. This is the tendency to say, I knew this was gonna happen. I already had this belief that the system was unstable and that manually going in and updating a configuration file was going to cause a problem. And voila, here it is! I was right. We can combat this by having someone play devil’s advocate during our analysis and ask questions like, did you actually know that, was this information available to us, let's go back and look at the logs, and could we have known that beforehand?

Additional biases include hindsight and negativity biases. Hindsight bias sees the incident as inevitable with little or no objective basis for predicting it because we know the outcome. This can be countered by using a timeline analysis that starts before the incident and work your way forward, instead of backwards from the resolution. Negativity bias assumes that things will continue to go bad, simply because they already are bad. You can combat this thinking by reframing the incident as a learning opportunity. This is easier said than done, but every failure in our system is a highlighter on an improvement we can make.

The 5 Whys and the Infinite Hows

The "5 Whys" technique originated from the founder of Toyota, Sakichi Toyoda. This method involves repeatedly asking 'why' to uncover the underlying cause of an issue. For instance, when a problem arises on the factory line, one asks why it occurred, followed by subsequent 'why' questions for each answer received. Usually by the fifth “why,” you’ll be close to the root cause of the issue.

However, the technique is not without its criticisms. Some, like John Allspaw, argue that it oversimplifies complex issues by implying a single root cause. To counter this, shifting the question from 'why' to 'how' (the Infinite Hows method) can provide a more nuanced understanding, acknowledging the possibility of multiple contributing factors to an issue.

In reality, both of these techniques can fail if you don’t approach the questions blamelessly.

Additional Tips for Blameless Analysis

Stick to the facts. Opinions are distracting.
Be willing to discuss one’s behavior while not applying a label.
Speak to specific instances of behavior rather than general trends.
Do not make implications about a person’s character based on their errors.
Accept feedback with the understanding that it is a necessary step in improvement.
No ego. We are here to solve problems with code.
Postmortem analysis: Investigate and understand the causes of a past issue.
Pre-mortem analysis: Visualize, with your team, how the system can fail and account for those failures.

When you look to blame humans, you don’t learn anything. Because [the unlucky humans that were in that unfortunate position at the time], don’t raise their hands and tell you the truth.” - Dave Rensin, Distinguished Engineer at Google

Blameless analysis is an industry standard and a philosophy that can transform how teams approach errors and system failures. It's about creating an environment where learning from mistakes is prioritized over finding fault, where innovation is encouraged, and where systemic improvements are continually sought. By embracing blameless analysis, engineering teams can not only improve their systems but also their collaborative dynamics, leading to more effective and resilient solutions.