If your data is worth storing, it's worth structuring

When some people talk about a Data Lake (or Hadoop, or even just Big Data), they go on to say that we can store all our data, unstructured, forever, and be able to analyse it at any time (maybe even in real-time!).

I’ve yet to hear about that dream coming true.

You often end up not even using the data. The cost of transforming it becomes too prohibitive, as you need to handle all the variations of the data, and all the edge cases within it, before you can do anything useful with it.

Let’s take the classic example of unstructured data: storing text documents. When it comes to processing it, wouldn’t it be great to have some metadata about it that’s easily accessible and known to be there? Maybe a timestamp, a source, an ID? Or maybe a flag telling me if it contains personal data, together with a retention policy?

If represented as JSON, it might look something like this¹:

{
  "uuid": "234",
  "timestamp": 12421,
  "source": "customer_upload",
  "customer_id": 12,
  "contents": "It was a bright day in April, and the clocks were striking thirteen."
}

That’s better, but we’ve only improved one part of your Data Lake. What about if we knew we could rely on all the data having at least some standard fields that are always going to be useful? An Envelope, if you like, that wraps every event in our Data Lake. Again, as JSON it might look a bit like this:

{
  "uuid": "234",
  "timestamp": 12421,
  "event_type": "customer_upload",
  "fields": {
    "customer_id": 12,
    "contents": "It was a bright day in April, and the clocks were striking thirteen."
  }
}

Another example event might look like this:

{
  "uuid": "234",
  "timestamp": 12421,
  "event_type": "app_instumentation",
  "fields": {
    "app_id": 7,
    "event": "query.completed",
    "duration": 0.752
  }
}

The fields will be specific to each event_type, but should be the same for every event of that type (well, not necessarily the same, but compatible²).

Now any user of the Data Lake can parse any event in the same way, ideally using the same code. We have some basic metadata to start exploring the data, and the fields for a particular event should be well documented and understood in order to be used for further analysis or transformation.

All of this doesn’t mean we can’t take advantage of applying a different schema on read. How we store the data in raw form does not have to be the same as how we present it to the end user, who likely won’t be using the Data Lake directly but maybe a Data Warehouse or some specialised data store.

When building your Data Platform (or adding to your existing one), think carefully about how you want the raw data to be saved in your Data Lake. Don’t just chuck your data in, unstructured, and worry about it later. If your data is worth storing, it’s worth structuring.

Cover image from Unsplash.

Don’t get hung up on JSON; it’s just used here for illustration. You would probably use something like Avro in production. ↩︎
If you’re using something like Avro or Thrift, ensure your schema can evolve by following their compatibility guidelines. ↩︎