Federated computational governance with data contracts

Hey, hope you’ve had a good start to June! I’m writing to you from Antwerp, Belgium, where I’m attending the inaugural Data Mesh Live conference. Looking forward to a great day of learning and meeting before my talk later.

Today’s newsletter is the final part of my series on implementing data mesh, and today we’re looking at federating and automating data governance with data contracts. See also:

There’s also links to articles on the future of Kafka, two opposing links on Iceberg, a new tutorial for data contracts, and a travel-related pun.

Plus, a fantastically detailed review of my book, which for this week only is 35% OFF both print and ebook!

Federated computational governance with data contracts

Data mesh describes how to decentralise data. However, one of the concerns with a decentralised model is how to maintain data governance.

That’s why data mesh has this principle of federated computational governance, which describes how to ensure that data is managed effectively while enabling data producers to operate with greater autonomy.

The word governance is well understood, but it’s the others that are important:

Federated, because with data mesh we’re allowing different parts of the business to create and maintain data products with autonomy, while still maintaining standardisation and governance that has been defined centrally
Computational, because with data mesh we automate the implementation of data governance, providing tools that implement these centrally defined rules

However, for this automation to be possible, these tools need to have enough context about the data to apply the governance rules to the data.

That context is provided by the data contract.

Let’s take data anonymisation and retention as an example.

We add categorisation into the data contract, defining whether a field contains personal data and, if so, how it should be anonymised when that data has passed its retention period or in relation to a deletion request.

This is added by the data producer, as they are best person to define this—it’s their data.

But they don’t need to know the different triggers that would cause a record to be anonymised (they don’t need to be GDPR experts!).

They also don’t need to implement their own tooling to anonymise the data.

The data platform does this, via data contracts.

The tools provided by the data platform codifies the centrally defined governance rules. They then runs those rules over all the data described by a data contract.

These tools don’t care about what the data is or why this field needs to be anonymised. They just need enough context to discover the data and run the rules.

Other examples include:

Use similar classifications to automate access controls
Automate the data lifecycle management (creation, deletion, backups, etc)
Populate a central data and processing catalog from all data contracts

As I wrote last week, in the ~5 years we’ve been doing data contracts we’ve implemented all our data platform capabilities through data contracts, and that includes the implementation of our data governance policies.

Data contracts are the only way I’m aware of to successfully implement federated computational governance.

35% OFF my book, this week only!

For this week only my book Driving Data Quality with Data Contracts is 35% when you buy directly from packt.com using the code DATA35.

If you’re still on the fence, check out this fantastically detailed review of my book here written by Winfried Adalbert Etzel.

Or this AI-generated summary of Amazon reviews:

Interesting links

Kafka: The End of the Beginning by Chris Riccomini

Interesting post looking at whether it’s time for an evolution of streaming, i.e. something other than Kafka. I have two thoughts.

First is that there are already cloud-native solutions that meet the needs of many, for example Google Cloud Pub/Sub, AWS Kinesis, etc.

Though of course there are use cases that do need the power of Kafka, and my second thought is what is missing for those users that highlights the need for something better/newer than Kafka? I’d guess it’s to reduce the total cost of ownership, which is often a problem, but with Confluent running Kafka for you on any cloud and Google running it for you on their cloud the cost of ownership has been reducing.

However, I do agree that stream processing is still harder than it ought to be, but it is gradually getting easier.

The Open Table Format Revolution: Why Hyperscalers Are Betting on Managed Iceberg by Simon Späti

An in-depth overview of Iceberg and the kind of stack it can enable.

But…

Apache Iceberg Isn’t Coming To Save You by Benjamin Rogojan

Which is also how I feel about Iceberg. Nice tech, but unlikely to solve big enough problems to justify a migration to.

Data Contract & Product Experimentation by Jean-Georges Perrin

Jean-Georges Perrin (JGP) has started running hands-on tutorials to demystify data contracts and data products, using the Open Data Contract Standard from Bitol. You can register your interest with this link.

Being punny 😅

I hate hotel bath towels. So think and fluffy, I can’t even close my suitcase.

That’s all for this week. Till next time!

Andrew