Incident response is a common topic of conversation in today’s DevOps organisations, especially as companies have come to understand the reputational and revenue importance of keeping their applications up and running without fail.
Commonly in the DevOps community, we talk about organisational designs and processes for effective incident response – like the Incident Command model, as discussed by our friends from Fastly in this talk. We also frequently discuss technologies and tools for incident response, spanning monitoring, alerting, logging, mitigation and many other focus areas. At NS1, we make use of an Incident Command model and have invested in the usual array of technologies for incident response, along with developing many of our own.
Despite the time organisations spend thinking about incident response processes and the investments they make in related technologies, it’s still possible to fail to respond effectively when a real incident arises. In our position at NS1, as a provider of mission-critical DNS and traffic management services for many of the largest internet properties, there’s little room for failure. We’ve learned that augmenting our processes and technologies with three simple practices makes a huge difference in real-world incident response situations.
My personal favourite practice for effective incident response is to use good checklists. A checklist is a simple, concise, easy-to-understand tool to help an operator recall key steps to take in a particular situation. In an intense incident response scenario, it can be easy for an operator to skip or forget steps in their response. Rather than being overly prescriptive, a checklist’s job is meant to help an experienced professional, who is familiar with the systems they’re operating, make the best use of their knowledge in a high-pressure situation.
One of my favourite examples of a great checklist isn’t one used in DevOps or networked systems operations – it’s the below checklist, the Cessna 172S Emergency Procedure for engine failure during flight.
This checklist is discussed in depth in a great book, The Checklist Manifesto, by Atul Gawande. This example demonstrates some of the key principles behind a good checklist: it’s simple and short, not overly specific or prescriptive; it’s limited to only the most important actions the operator needs to recall during an incident mitigation; and because of its simplicity and clarity, it helps the operator calmly execute those key actions.
There’s another great lesson in this Cessna checklist for operators of today’s IT infrastructure. As Gawande describes in his book, “Because pilots sometimes become so desperate trying to restart their engine, so crushed by the cognitive overload of thinking through what could have gone wrong, they forget this most basic task: FLY THE AIRPLANE.” In IT operations, it’s no different – we can sometimes become so distracted by root cause analysis and the desire to know why an incident has occurred that we forget to focus on mitigation first – analysis can come later, once the incident is resolved.
We can have the best organisational models, great tools and technologies, and even complete and well-vetted checklists, but how do we get great at using them all in real-world situations? Practice makes perfect.
Fire drills are an important aspect of any incident response strategy because they ensure we regularly exercise our processes, tools and checklists in simulated scenarios that approximate real-world incident response situations. There are a number of ways to create fire drills with respect to networked systems. One tool we often discuss in the technical operations community is one built by Netflix called Chaos Monkey, a software embodiment of what has become an emerging discipline: Chaos Engineering. It’s a methodology for improving the resilience of a distributed system by creating failure situations – for example, by randomly “breaking” virtual servers in a distributed application to force failure handling systems to engage. This is a kind of “automated” fire drill that drives resilient architecture.
Another kind of fire drill that we use at NS1 is a “war game” to simulate DDoS attack situations. We build tools similar to those used by sophisticated attackers and regularly exercise our DDoS mitigation response by literally attacking production systems. In this type of fire drill, we split our team into attackers and defenders (similar to the red/blue team approaches used by information security organisations for penetration testing). We’ve learned that running good DDoS war games requires us to attack real systems with real attack techniques, which helps us understand the dynamics of our distributed systems and builds muscle memory in our team around key steps needed to respond to real-world incidents. We’ve also found that fire drills like these are great tools for team building; they’re almost like running football or soccer scrimmages at the end of a long practice session, helping everyone on the team understand how they can best contribute and what “position” they should play in a real-world event.
A final practice for driving effective incident response is postmortem or retroactive analysis. After any incident, real or simulated, a thorough review of the incident and the response is critical. Asking and carefully answering some simple questions about the response to the incident is the key: What happened and why? What worked well in our response? How could we have responded more effectively?
The goal of postmortem analysis is to create a feedback loop and apply learning from the incident, so that in the future, your team’s response to incidents in general—and certainly to similar incidents—is more effective. Obviously, it’s critical to fix any issues that surface during an incident response in your systems, incident response processes and mitigation tools. You should also review relevant checklists after an incident: what was missed, or what was confusing? And finally, feed the outputs of real or simulated incidents into your fire drills so you can practice similar scenarios in the future.
Incidents happen to every technology company delivering content or applications in networked systems. While such events stress teams to their limits, some simple best practices help ensure incidents are handled effectively, and a team’s incident responses improve over time.
Kris Beevers, CEO and co-founder, NS1
Image Credit: Profit_Image / Shutterstock