Moving less data, with federated queries

I mentioned yesterday that one way to reduce the amount of data transformations (and the costs of them) is to challenge the assumption we need to bring all data centrally before it can be useful.

One of the arguments for bringing all the data centrally is that there is, or could be, great value gained by joining that data with your other datasets.

And I think that’s true. But moving data around - which is very costly! - is just one way to do that. Another is to federate your queries so they run directly against the data source.

Motherduck published a post yesterday that does a good job of describing what federation is and why it’s worth considering, but to summarise briefly it’s simply a way of querying one database (such as DuckDB) and having that retrieve the data from another database (such as MySQL) and serve it to you.

And while support for federated queries in DuckDB today is quite limited, Trino has great support and can federate queries against many data sources, including databases such as BigQuery and MySQL but also, interestingly, against any OpenAPI endpoint, with services like Jira and Github already tested.

I think this is where the industry is heading, and while it may not (yet?) work for all use cases (performance can sometimes be an issue), anytime we can avoid moving data around, and all the costs and complexity that comes with that, the better.