Skip to main content

A contract-driven data catalog

·2 mins

I wrote yesterday about how important a data catalog becomes when you move to a decentralised and autonomous data architecture built on data contracts.

But how should this data catalog be populated?

Many data catalogs connect to data sources, such as your data warehouse, and scan those sources to collect the metadata, which would include:

  • The schema
  • Usage statistics
  • Relationships (lineage)
  • Any documentation in that data source, such as column descriptions

Some of those we already have in the data contract, in particular the schema and the documentation, and the data contract is the source of truth for those. But that’s ok, as we know the data contract will keep the table updated as the contract evolves, so it’s a reasonable source for the data catalog.

Most data catalogs also allow you to enrich this metadata with additional information, such as:

  • Data classifications
  • Annotations
  • Tags

And so on, directly in the data catalog.

But now we have some metadata defined in the data contract, and some defined in the data catalog. Sooner or later (usually sooner) they are going to fall out of sync.

The data contract is much more likely to be kept updated, as the data producer must change it when they change the structure of their data, else the change wont take effect.

Data contracts are also located near the code that publishes this data, so any change to that code also changes the data contract - either directly or as part of the same PR, with CI checks ensuring that happens. That’s much less friction then remembering to visit a separate web interface to make updates.

So, it makes sense for all the metadata to be defined in the data contract.

Which means, the data catalog should be sourcing this information from the data contract. Not the data warehouse, and not from its own web interface.

It should be a contract-driven data catalog.


Want great, practical advice on implementing data mesh, data products and data contracts?

In my weekly newsletter I share with you an original post and links to what's new and cool in the world of data mesh, data products, and data contracts.

I also include a little pun, because why not? 😅

Enter your best email here:

    (Don’t worry—I hate spam, too, and I’ll NEVER share your email address with anyone!)

    Andrew Jones
    Author
    Andrew Jones
    I build data platforms that reduce risk and drive revenue.