Skip to main content

Data contracts are simple, but powerful

·1 min

Data contracts are a simple idea. You simply describe your data in a structured format.

For example, here is a YAML file I use in my training.

dataset: customers
owner: [email protected]
description: All active customers of our product.
version: 1

columns:
  - name: id
    description: Unique ID for each customer
    data_type: VARCHAR
    checks:
      - type: no_missing_values
      - type: no_duplicate_values
  - name: size
    description: The customer's t-shirt size
    data_type: VARCHAR
    checks:
      - type: invalid_count
        valid_values: ['S', 'M', 'L']
        must_be_less_than: 1
  - name: created
    description: The timestamp at which the customer object was created
    data_type: TIMESTAMP
  - name: distance
    description: The distance the customer is from our shop
    data_type: INTEGER

But this unlocks so much power, allowing us to:

  • Create a stable interface for the data, much like an API
  • Automate the running and reporting of data quality checks
  • Improve communication between those that create data and those that consume it

And more.

Data contracts are simple, but powerful.