Skip to main content

Data contracts are a simple concept

Simple concept, but powerful in practice
·4 mins

Hello 👋

In this week’s newsletter I write about the simple concept of data contracts, and its power.

There’s also links to articles on the 7-table fallacy and measurement engineering.


Data contracts are a simple concept

In a virtual book signing earlier this week at ODSC East, I was asked:

For teams new to the concept, how would you explain data contracts in simple terms?

My answer:

A data contract is simply a human and machine readable document that describes the data.

That’s it!

They really are a simple concept, based on the idea that with a bit more context, we can do so much more.

For example, just a simple data contract with a schema allows us to create and manage tables, as shown below:

name: Customer
description: A customer of our e-commerce website.
version: 1
fields:
  id:
    type: string
    description: The unique identifier for the customer.
    required: true
  name:
    type: string
    description: The name of the customer.
    required: true
  email:
    type: string
    description: The email address of the customer.
    required: true
  language:
    type: string
    description: The language preference of the customer.

All we need to do is convert that to something an infrastructure as code tool can understand, and we now have a table under change management, driven by the data contract.

A diagram showing the process flow from data contrasts with a note on change management, passing through context to Iac (Infrastructure as Code), and finally to data warehouse storage.

Add a bit more context to the data contract, such as SLOs and/or data quality rules, and we can implement observability:

name: Customer
description: A customer of our e-commerce website.
owner: [email protected]
version: 1
slos:
  timeliness: 1hr
fields:
  id:
    type: string
    description: The unique identifier for the customer.
    required: true
  name:
    type: string
    description: The name of the customer.
    required: true
  email:
    type: string
    description: The email address of the customer.
    pattern: "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$"
    required: true
  language:
    type: string
    description: The language preference of the customer.
    enum: [en, fr, es]

Again, the implementation is simple, just converting the data contract to something that can be understood by Great Expectations, Soda, etc.

A hand-drawn diagram illustrating a data observability process with a data contracts document providing context to an observability service that sends alerts to users and performs data quality checks on a data warehouse.

Add a bit more context, such as an anonymisation strategy, and you can create an anonymisation service that ensures data is anonymised to prevent access and/or when it has breached its retention period.

name: Customer
description: A customer of our e-commerce website.
version: 1
fields:
  id:
    type: string
    description: The unique identifier for the customer.
    required: true
  name:
    type: string
    description: The name of the customer.
    required: true
    anonymisation_strategy: hex
  email:
    type: string
    description: The email address of the customer.
    required: true
    anonymisation_strategy: email
  language:
    type: string
    description: The language preference of the customer.

You then use this contract in a small tool that anonymises the data using the features of your data warehouse.

A diagram showing the process of data anonymisation: data contracts produce context, which is processed by an anonymisation service to produce anonymised data stored in a data warehouse.

The data contract remains simple, both as a concept and as a document, and yet the ability to use it to automate the difficult parts of data creation and management are limitless.

That’s the power of the data contract.


The 7-Table Fallacy: Why Text-to-SQL Isn’t Enterprise AI by Timothy W. Cook

Really good read arguing we don’t just need good column names and the LLM will work it out, nor do we need semantic layers. We need context inside with the data.

Measurement Engineering: The Part of Data Science That Will Thrive in AI by Eric Weber

It’s not just showing the data, its understanding what the numbers mean and what they can and cannot support.

It’s probably not a new job title though - it’s just what we should be doing.


Being punny 😅

I’ve just won the ‘World’s most secretive person’ award. I can’t tell you how much it means to me.


Thanks! If you’d like to support my work…

Thanks for reading this weeks newsletter — always appreciated!

If you’d like to support my work consider buying my book, Driving Data Quality with Data Contracts, or if you have it already please leave a review on Amazon.

🆕 I’ll be running my in-person workshop, Implementing a Data Mesh with Data Contracts, in June in Belgium. It will likely be only in-person workshop this year. Do join us!

Enjoy your weekend.

Andrew


Want great, practical advice on implementing data mesh, data products and data contracts?

In my weekly newsletter I share with you an original post and links to what's new and cool in the world of data mesh, data products, and data contracts.

I also include a little pun, because why not? 😅

    Newsletter

    (Don’t worry—I hate spam, too, and I’ll NEVER share your email address with anyone!)


    Andrew Jones
    Author
    Andrew Jones
    I build data platforms that reduce risk and drive revenue.