Count the joins

I enjoyed this post from Nicole Radziwill, PhD on LinkedIn:

How fragile are your pipelines? Start with this simple metric: COUNT THE JOINS. Every time you have to join, you’re making multiple assumptions about the underlying raw data, the biggest one being: you’re assuming it’s not going to change.

Not only does that increase the fragility, joins are costly. Particularly on modern data warehouses like Snowflake and BigQuery.

Nicole then goes on to say:

Without a modelled “clean data layer” that can stay invariant even when software engineers change the systems upon which they’re based, your organization will struggle with data integrity.

Which is what I argue for a lot, and what data contracts was originally designed to solve, by providing a well-defined interface to that data.