Data pipelines as dependable as software
Hey, hope you’ve had a great week! First, a couple of things from me.
The early bird pricing for my June implementing data contracts workshop in Belgium ends Monday! Would love to see you there :)
Secondly, the online talk I gave on Data Quality: Prevention is better than the cure is now available to watch on YouTube.
Now on to today’s post, which is about what we would need to do to build data pipelines that are as dependable as software systems.
Data pipelines as dependable as software
If you’re building key product features on top of your data pipelines, shouldn’t they be as dependable as your software-backed features?
What would it take for your data pipelines to be that dependable?
Probably some or all of the following:
- Unit testing: The ability to test the logic of your pipelines locally, with known inputs matching expected outputs.
- Integration testing: Deploying your pipeline in a representative environment.
- Observability: Knowing what is going on at each stage in the pipeline, with alerts when something anomalous happens.
- Continuous deployment: Having enough confidence in your pre-prod testing that you would deploy your code as soon as it is merged.
- Immediate rollbacks: No matter what testing you do, not every issue will be caught. When there is a problem you need to rollback quickly, either through a new deployment or through switching a feature flag.
- SLOs: Not just saying your pipelines are dependable, but being held accountable to the SLOs you are providing.
- Trust, but verify inputs: Assume failures in your upstream dependencies and prevent those from impacting your service.
- Incident management: Follow a well-defined incident process, have runbooks for common tasks, etc.
These are all things we do well in software engineering.
And I don’t see any reason why we can’t do these well in data engineering too.
Interesting links
The Fallacy of Data-Driven Strategy by Collin Prather
Interesting read. It’s data + your own understanding and creativity that creates a strategy, not just data on its own.
Semantic Data Model: The Blind Spot Holding Back Your AI Agent by Joseph Petty
On the importance of semantic models for AI agents. Also good definitions of ontologies, semantic data models, and knowledge graphs.
How we saved $3.5M in BigQuery costs on deleting inactive user data by Dominik Rys
My colleague at GoCardless wrote about 5 BigQuery optimisations we made to reduce the costs of carrying out deletions for inactive users.
Being punny 😅
5 ants rented an apartment with another 5 ants. Now they’re tenants.
Upcoming workshops
- Implementing a Data Mesh with Data Contracts - Antwerp, Belgium - June 5
- Alongside the inaugural Data Mesh Live conference, where I’ll also be speaking.
- Early Bird pricing available until April 30
- Sign up here
Thanks! If you’d like to support my work…
Thanks for reading this weeks newsletter — always appreciated!
If you’d like to support my work consider buying my book, Driving Data Quality with Data Contracts, or if you have it already please leave a review on Amazon.
Enjoy your weekend.
Andrew