Make your data reliability problem visible

This is part 2 in my Data Reliability series. You can read the rest of the series on my website.

As I wrote yesterday, you have a data reliability problem, but only you know it. You need to make this problem visible to the rest of the organisation, because if you can do that, you can get their help to solve this problem.

The best way to increase that visibility is to start treating data issues as incidents.

You might already have an incident process defined somewhere in your organisation, for example in software engineering. If you do, simply start using that whenever there are data issues.

If not, create a process that includes the following:

You raise an incident as soon as an issue is detected and assess the severity in terms of its potential business impact
As part of the incident you work together with relevant teams, including those whose work may have unknowingly caused the incident, to resolve issues as quickly as possible
Provide regular updates to stakeholders so they know whether the data they are using is affected by the incident
Keep track of the investigation and the actions you took during the incident as you work to resolve it
Once an incident has been resolved, analyse the incident through a postmortem process to understand the root cause, capture the key takeaways, and take action to prevent similar failures in future

That might sound like a lot, but in practice could be as simple as:

Create a Slack/Teams channel as soon as an issue is detected
Invite any relevant teams, for example the data producers, who might have context on why the issue has occurred
Invite stakeholders to the channel and post regular updates for them
After the incident write up the key details and learnings in a document

The incidents themselves are a great way of increasing visibility, but they have other benefits too.

You’re bringing together different people from different teams who have an interest in this data. That includes the data consumers, who know best the value of this data for their day-to-day work, and the data producers, who have the most context of the data, how it’s generated, and how it changes over time.

These connections will be incredibly valuable when we start thinking about a solution to our data reliability problem.

We also start to get some great data on our data reliability, including:

How often we have data incidents
The datasets that are most often impacted
The most common root causes of data incidents

Finally, we get data that tells us the cost of these data incidents.

You can track how many hours are spent on them by members of your team, multiply that by the average employee cost of your team, and you have a monetary value you can assign to this problem.

This will be useful data for later, when we will need to create a business case for solving this problem.

Once you’ve been doing this for a little while and you have some data, we can start investigating the root cause of these incidents.