Skip to main content

Metadata management at Notion

·4 mins

A couple of months ago the team at Notion published a great post titled a brief history of Notion’s data catalog.

While the focus is on the data catalog, it’s really a lesson in metadata management.

And while they don’t call it a data contract, the solution is very much like a first step towards implementing data contracts.

The problems they had

The problems Notion were trying to solve include:

  • Inconsistencies in their data environment
  • Absence of structure making it difficult to classify data events
  • Unclear ownership and responsibilities leading to governance and quality issues
  • Difficult to discover data

These are common problems in many organisations that are managing complex and evolving data environments—particularly if you don’t address them until later.

Finding a solution

Notion’s initial solution was to implement a data catalog and index the data warehouse, but they found that data catalog wasn’t being used. They found three reasons why it wasn’t being used:

  1. Their data was often unstructured.
  2. Their data wasn’t documented, particularly the metadata in the data warehouse. Even where there was documentation, it fell out of sync as the business logic evolved.
  3. The documentation wasn’t always propagated to the downstream tables.

So, they decided to take a different approach to increase user engagement and use of the data catalog.

This approach prioritised quality, both for the data, where they added more structure, and for the metadata, where they captured it at the source, populated by the Engineers who create the data.

Capturing the metadata

Notion selected an Interface Definition Language (IDL) to capture their metadata, i.e. the data contract. This could have been Avro, Protobuf, etc, but they made what they admit is an unconventional choice of Typescript.

This is a great choice. One of the primary reasons Notion chose Typescript is the same reason we chose to implement data contracts in Jsonnet at GoCardless, which is engineer familiarity. As they write:

Engineer familiarity: Most Notion engineers are familiar with TypeScript, which helps us maintain high development velocity by eliminating the learning curve associated with adopting a new schema language.

If you want your engineers to populate and maintain this metadata you have to provide them tools they are happy to use and with a low barrier to adoption.

For GoCardless it was Jsonnet, for Notion it was Typescript, for you it might be something else.

The team at Notion reiterate this later on in the post in their lessons learned:

Carry a user-first mindset In the upstream process to create descriptions, we plugged into existing engineering workflows, such as using TypeScript over Protobuf

Populating the data catalog

The data catalog at Notion is now populated from this metadata.

To do that, they transform this Typescript into JSON, and as I’ve written before it’s common to create many representations of your data contract to integrate with other systems.

The data catalog then syncs the documentation to other systems in their data stack, including their data warehouse and their BI tools.

The importance of quality metadata

This is a great example showing how important quality metadata is if you want to make your organisations data easily discoverable and usable.

As they say:

Creating high-quality descriptions starts with gathering a comprehensive set of metadata, including the tables’ content (“what” and “how”) and context (“why”)

Again, though they don’t call this a data contract, this aligns perfectly with the ideas around data contracts.

Check out the full post on the Notion blog to read more details, including how they use AI to generate documentation of downstream datasets.


Relationships over Requirements by Johnny Winter

To have a greater impact and deliver more value:

  • Focus on collaboration
  • Build meaningful connections
  • Align on shared business goals

Building LLMs is probably not going be a brilliant business by Cal Paterson

Interesting article on the business of building LLMs.


Being punny 😅

A Very Happy Christmas to you all!

I was legally required to say this as the T&Cs for this newsletter has a Santa clause.


Upcoming workshops


Thanks! If you’d like to support my work…

Thanks for reading this weeks newsletter - always appreciated!

If you’d like to support my work consider buying my book, Driving Data Quality with Data Contracts, or if you have it already please leave a review on Amazon.

This is the final newsletter of the year. If you’re having time off enjoy it! See you in the new year.

Andrew


Want great, practical advice on implementing data mesh, data products and data contracts?

In my weekly newsletter I share with you an original post and links to what's new and cool in the world of data mesh, data products, and data contracts.

I also include a little pun, because why not? 😅

(Don’t worry—I hate spam, too, and I’ll NEVER share your email address with anyone!)


Andrew Jones
Author
Andrew Jones
I build data platforms that reduce risk and drive revenue.