Great post on the Box tech blog on how they deal with system outages, starting with how they used to deal with them and what they have done to improve things, which has now lead to less severe outages and improved system availability.
I especially like using PIE (Probability of recurrence; Impact of recurrence; Ease of addressing) as a ranking of each outage, allowing you to prioritise your investigation and prevention against other outages and your other work.
The systems I work on are all internal systems and nothing like on the scale of Box, but we suffer from the same problems. After an outage, you just forget about it till it happens again. I’m going to think about how I can apply the lessons learned by Box to our work.
Another thing we have started looking at is having each service report its health, and having a system perform health checks. The results would be stored so you could analyse it over time, and also displayed on a service page as traffic lights. We got this idea from Airbnb, who mentioned their approach in their post on service discovery1:
In order to know whether a particular backend can be registered, Nerve performs health checks. Every service that you want to register has a list of health checks, and if any of them fail the backend is de-registered.
Although a health check can be as simple as “is this app reachable over TCP or HTTP”, properly integrating with Nerve means implementing and exposing a health check via your app. For instance, at Airbnb, every app that speaks HTTP exposes a /health endpoint, which returns a 200 OK if the app is healthy and a different status code otherwise. This is true across our infrastructure; for instance, try visiting https://www.airbnb.com/health in your browser!