Skip to main content

Data pipelines as dependable as software

·3 mins

Hey, hope you’ve had a great week! First, a couple of things from me.

The early bird pricing for my June implementing data contracts workshop in Belgium ends Monday! Would love to see you there :)

Secondly, the online talk I gave on Data Quality: Prevention is better than the cure is now available to watch on YouTube.

Now on to today’s post, which is about what we would need to do to build data pipelines that are as dependable as software systems.


Data pipelines as dependable as software

If you’re building key product features on top of your data pipelines, shouldn’t they be as dependable as your software-backed features?

What would it take for your data pipelines to be that dependable?

Probably some or all of the following:

  • Unit testing: The ability to test the logic of your pipelines locally, with known inputs matching expected outputs.
  • Integration testing: Deploying your pipeline in a representative environment.
  • Observability: Knowing what is going on at each stage in the pipeline, with alerts when something anomalous happens.
  • Continuous deployment: Having enough confidence in your pre-prod testing that you would deploy your code as soon as it is merged.
  • Immediate rollbacks: No matter what testing you do, not every issue will be caught. When there is a problem you need to rollback quickly, either through a new deployment or through switching a feature flag.
  • SLOs: Not just saying your pipelines are dependable, but being held accountable to the SLOs you are providing.
  • Trust, but verify inputs: Assume failures in your upstream dependencies and prevent those from impacting your service.
  • Incident management: Follow a well-defined incident process, have runbooks for common tasks, etc.

These are all things we do well in software engineering.

And I don’t see any reason why we can’t do these well in data engineering too.


The Fallacy of Data-Driven Strategy by Collin Prather

Interesting read. It’s data + your own understanding and creativity that creates a strategy, not just data on its own.

Semantic Data Model: The Blind Spot Holding Back Your AI Agent by Joseph Petty

On the importance of semantic models for AI agents. Also good definitions of ontologies, semantic data models, and knowledge graphs.

How we saved $3.5M in BigQuery costs on deleting inactive user data by Dominik Rys

My colleague at GoCardless wrote about 5 BigQuery optimisations we made to reduce the costs of carrying out deletions for inactive users.


Being punny 😅

5 ants rented an apartment with another 5 ants. Now they’re tenants.


Upcoming workshops

  • Implementing a Data Mesh with Data Contracts - Antwerp, Belgium - June 5

Thanks! If you’d like to support my work…

Thanks for reading this weeks newsletter — always appreciated!

If you’d like to support my work consider buying my book, Driving Data Quality with Data Contracts, or if you have it already please leave a review on Amazon.

Enjoy your weekend.

Andrew


Want great, practical advice on implementing data mesh, data products and data contracts?

In my weekly newsletter I share with you an original post and links to what's new and cool in the world of data mesh, data products, and data contracts.

I also include a little pun, because why not? 😅

(Don’t worry—I hate spam, too, and I’ll NEVER share your email address with anyone!)


Andrew Jones
Author
Andrew Jones
I build data platforms that reduce risk and drive revenue.