Data incidents
No matter what we do, when working at sufficient scale and/or sufficient speed it’s inevitable that things will go wrong. This is well accepted in software, and the same for data too.
As the team at incident.io wrote recently, we need to start treating our data incidents in the same way good software engineering teams treat their incidents. This means having a well-defined process to follow while the incident is occurring to ensure good communication and aid a speedy resolution, and a postmortem process to discuss what went wrong, why, and how to prevent a similar incident happening again in the future.
This postmortem process is particularly useful for us data folks when it comes to dealing with upstream issues that break our pipelines. We can use it to start conversations with our data generators and show them the impact caused by those issues and get them involved in the potential solutions. It builds relationships between those that generate data and those that consume data.
We also start getting some data on what the most common incidents are and their causes, which can help with prioritising any potential solutions that fix these issues at source.
P.S. I’ll be appearing on the excellent Catalog & Cocktails podcast tomorrow for a no-bs chat about data contracts! You can join the live stream on LinkedIn at 10pm GMT/4pm CT.