The quality of a data product is its data and its code

Hello 👋

Today I write about how the quality of a data product is not just the data, it’s the code behind it too.

There’s also links to articles on building an LLM platform, speed when decentralising, and building decision systems.

The quality of a data product is its data and its code

Often when we think about the quality of a data product, we think only about the data quality, such as accuracy, completeness, validity, etc.

Sometimes we’ll also track its timeliness, performance, and other SLOs.

These are all important, but only track the quality of the data product in production, and are reactive metrics, only telling us when something has already gone wrong.

To be confident in the quality of a data product we also need to monitor the quality of the code behind the data product.

These quality metrics detect potential issues such as technical debt, security vulnerabilities, and maintainability issues, all of which could lead to production failures, and do so earlier and at lower cost than waiting for an issue to hit production, impact your users, and corrupt your data.

This is already common in software engineering, where service catalogs are used to monitor the quality of code across an organisation so engineering leaders can understand where investment is needed to make improvements before issues start impacting performance.

The same is true for data products, which will usually have some code behind them, e.g. dbt pipelines, Spark jobs, Python workflows, etc.

Like the code for services, this is all just software that is sitting in a Git repository somewhere, and it can be monitored in the same way - perhaps by simply using the same software catalog your engineering org uses. The checks might be different, for example for a data product you might want to check for an up-to-date dbt version, correctly configured CI jobs, clear and complete documentation, and so on, but the premise is the same.

This, along with your existing data quality checks, gives you a complete picture on the quality of your data products.

Interesting links

The model is the easy part: Building the LLM Platform at Whatnot by Stas Sajin, Faithful Alabi, Peiyun Zhang, Peicheng Yu

Great article both on building an LLM platform, and how to think about building platforms in general.

I also liked how they could leverage their existing platform capabilities:

A big reason we were able to build the platform quickly, and with a small team, is that Whatnot had already spent years investing in the foundations underneath it. This was the result of earlier platform work across the company: building the modern data stack, scaling investments, and strengthening shared tooling and other internal platform primitives. Those investments meant integrations, logging patterns, analytics sinks, and internal tooling foundations were already there when we needed them.

See also my post on the iPhone model for integrated platforms.

Speed Is a Design Choice, Not a Property of Decentralisation by Bjørn Broum

Another great read from Bjørn.

The question is not whether to decentralise. It is whether the organisation is willing to do what decentralisation actually demands: design the shared conditions for coherent independence before distributing the work, not after the debt has compounded beyond the point where it can be paid cheaply.

I Thought I Was Just A Data Engineer. I Was Building Decision Systems. by Sweta Mankala

Decision systems are the key, whether that decision is made by a human or an AI.

From my blog

Per-project terminal colours

I made a small but nice improvement to my terminal recently, where each project (git repo) now has its own background colour in the terminal. It’s an effective context signal telling me which project I’m currently in as I’m bouncing around terminal windows, and might be of interest to you too!

Being punny 😅

Someone ripped off the front and back sections of my dictionary. It just goes from bad to worse.

Thanks! If you’d like to support my work…

Thanks for reading this weeks newsletter — always appreciated!

If you’d like to support my work consider buying my book, Driving Data Quality with Data Contracts, or if you have it already please leave a review on Amazon.

🆕 I’ll be running my in-person workshop, Implementing a Data Mesh with Data Contracts, in June in Belgium. It will be only workshop this year. Do join us!

Enjoy your weekend.

Andrew