We had an incident, and it was great
We recently had an incident with our data pipeline, resulting in data being lost on route to our data platform. Of course, you never want an incident, but failures are a fact of life. What’s important is how you prepare for them and respond to them, and in that sense this was a great incident.
Background #
At GoCardless we’ve spent the last 3 years building out a best-in-class data platform to drive our analytics, data science, and data driven products such as Success+. The first 2 years was primarily about getting the primitives in place and bringing our data together so we can unlock the value of our data, as discussed in this example written up by one of our Principal Engineers.
In the last year we’ve really been focussing on increasing the robustness and reliability of the data platform. It’s not that it was unreliable before, but as it starts to be depended on by more critical processes and drives products our customers are paying for, we made a conscious decision to invest in taking it to the next level of maturity.
Building confidence in our data #
The first question a user will ask when using a data platform is: can I trust the data?. We need to not only be able to answer that question, but prove it.
Since the start we’ve had some validations against the data we ingest, but they only checked some recent records against the source data. We wanted to go further and prove the data is 100% correct, across all time.
That’s obviously quite a challenge, and how we did that is a topic for another post. But having that in place allows us to know not only that something is broken, but exactly what records are affected. Then following the incident we can provide a list of those records to our users so they can ensure any downstream processes are corrected.
Planning for failure #
As already mentioned, failure is inevitable. Whether that’s caused by bugs in our services, outages in our cloud provider, or something completely unforeseen, we have to accept failure will happen.
We’ve spent some time thinking about what might fail and how we would recover. That lead to a number of initiatives, such as improving our backups, ensuring our Pub/Sub subscriptions were set up for replaying messages, and documenting exactly how we could recover in the event of a failure.
We also wrote an incident response protocol, giving us a step-by-step guide to follow when we have an incident. This sounds like a small thing, but it ensures everyone knows their roles and responsibilities and we communicate both within our team and outside to our affected users effectively.
The incident #
The incident itself was caused by a date being ingested that was greater than 9999-12-31, which is the largest supported by BigQuery. We implemented an easy fix in the pipeline to cap dates. However, we’re using a streaming Dataflow pipeline, so when the BigQuery load job fails it will retry indefinitely and the pipeline will stall. We attempted to drain it, expecting all but the problem records to be written, then terminate it. However, because the pipeline was partially stalled a large number of records were lost when we terminated.
Our daily validations failed the next morning and reported a large number of records missing in BigQuery. When we looked in to those failures we quickly found that a number of tables were affected, and the timestamps matched the problem with the pipeline. To repair the data we simply replayed the events from Pub/Sub, seeking back to a timestamp before the issue. We had never done that to resolve an incident before, but we had expected that one day we would need to and had created a runbook to follow. Within a couple of hours, the data was corrected and the incident resolved.
Throughout the incident we followed our incident response protocol, assigning someone to lead the resolution and a different person to handle comms. The team had regular catch ups on Zoom and communication continued in the dedicated Slack channel, which was open to all. The best thing about having a defined protocol to follow was that it kept everyone calm, both inside the team and out. We felt prepared and in control.
We were only able to recover so quickly and so confidently because of the work we have done over the last year. If this had happened 12 months ago, it would have been a very different story…
Of course, not everything went perfectly and we have some learnings and actions following the postmortem, but this felt like a massive moment in our path to building a best-in-class data platform.
All in all, it was a great incident.
Fancy working with us to build a best-in-class data platform? My team is hiring. Feel free to get in touch!
Cover image from Unsplash.