Performing a root cause analysis
This is part 3 in my Data Reliability series. You can read the rest of the series on my website.
On Monday we described our data reliability problem. Yesterday we made that visible and collected some data about it. Today we’re going to understand exactly what is causing this problem.
To do that, we are going to perform a root cause analysis.
A root cause analysis is a process for identifying the fundamental cause(s) of a problem. It is used in many industries for many different types of problems, and is equally useful in helping us identify the cause(s) of our data reliability problem.
There are a few techniques we can use for a root cause analysis, including the 5 whys, and the one I’m going to use is called a fishbone diagram.
The way it works is:
- You put the problem statement at the head of the fish
- From the backbone you add a bone for each major category of causes
- For each category, brainstorm potential causes that could contribute to the problem, and add them as bones from the category line
For example, you’ll likely find that one of your most common root causes for data incidents is an upstream breaking schema change, where some data you relied on had it’s schema changed in an incompatible way (for example, removing a field you depended on) that broke your data application.
If we use a fishbone diagram to do a root cause analysis on that we will get something that looks like this:
(If you want the fish-like version, click here.)
Don’t just use mine though!
Perform this analysis yourself, based on the root causes you’ve seen from your data incidents.
Get people outside of your team involved too. The stakeholders, the data consumers, and the data producers. All those people who you’ve been adding to your incident channels.
Not only does that help continue to build those relationships, they will undoubtably have great input, while also learning more about what you see from your point of view.
It’s likely when you’re done you will have something similar to what I have, as while our data is different, our problems are often the same. But the process itself is still important.
With this deep understanding of the problem and it’s root causes, we can now start thinking about solutions. More on that tomorrow.