2024
I wrote earlier this month that data contracts shouldn’t focus on enforcement.
By which I meant, the outcome you’re optimising for isn’t enforcing rules on someones data, but instead using data contracts to facilitate a better quality dataset that allows others to build on it with confidence.
What do you want from your data?
Do you want it to be fast changing?
You try your best to work around the poor quality data you’re given.
Only to deliver a poor outcome to your users.
I enjoyed this post from Nicole Radziwill, PhD on LinkedIn:
How fragile are your pipelines? Start with this simple metric: COUNT THE JOINS. Every time you have to join, you’re making multiple assumptions about the underlying raw data, the biggest one being: you’re assuming it’s not going to change.
If you’re a software engineer, and an upstream dependency is unreliable, then you would speak to the team who owns that dependency.
If you want to improve the quality of the data
Then you’ll need to speak to the producer of the data.
Staging layers, medallion architectures, data testing, assigning data stewards, gatekeeping application changes until reviewed by a data team.
Data quality can only be improved at the source.
If the source of the data isn’t capturing the data at the required accuracy, there’s nothing you can do later to increase the accuracy.
As I wrote yesterday, many data professionals don’t trust the data they are building on. And many users of data and data applications don’t trust the data they’re being provided.