The first step in effective incident response is to define an incident

null

Everyone would agree that it’s important to have a solid incident response plan in place, but such a plan is possible only when everyone agrees on the definition of an “incident”. Without a commonly understood definition, response will never be as effective as it could (and should) be.

In many ways, whether something qualifies as an incident is subjective. What might be an issue with one company may be nothing to another, but that issue in question may amount to a significant loss of revenue or even company reputation. The same can be true within a company: For example, the CRM system going down may affect the sales team’s ability to seize revenue opportunities.

At the end of the day, however, a problem in one department—or for one employee—is a problem for the company. It’s for this reason that organisations must come to agreement on what defines an incident, who should deal with it and how.

Putting incidents in context

Generally speaking, an incident is an unplanned disruption or degradation of service that adversely affects internal and/or external customers. The team responsible for an affected system can address an incident. A major incident, meanwhile, requires collaboration between multiple teams and business units such as ITOps, DevOps, customer support, sales, security or marketing.

Incident response refers to the process by which organisations deal with issues that have exceeded a specified parameter of normal operations. The main goal of incident response is to limit the negative impact caused by an incident and to reduce the time and money it takes to resolve the matter. Equally important, incident response involves retroactively analysing the event to ensure that the issue that caused the problem doesn’t occur again in the future.

There are many steps involved in setting up an incident response system and team—not to mention establishing a culture focused on collaboration to mitigate the overall impact--but one of the most important steps is determining what defines an incident at your organisation.

Indeed, during any given time frame, unexpected things are bound to happen. This is especially true given the pace of the digital workplace, where failing—and recovering—fast are all a part of smart business.

Therefore it can be challenging to distinguish between day-to-day operational maintenance issues and customer-impacting incidents—not to mention determining what constitutes a “major” incident.

Classifying incidents

It is important to have a scale that can be used to measure the severity of an incident—a set of pre-defined guidelines for determining whether an incident is minor, major or somewhere in between. ITOps, DevOps and development teams use these measurements to guide the actions they take to address the problem.

One way incidents can be classified is by severity. This is usually done by using "SEV" definitions, with lower-numbered severities being more urgent:

Here is a sample severity scale ranging from 1 to 5.

  • Level 1 Severity: A critical issue that warrants public notification and collaboration with executive teams
  • Level 2 Severity: A critical system issue actively impacting many customers’ ability to access services
  • Level 3 Severity: Stability or minor customer-impacting issues that require immediate attention from service owners
  • Level 4 Severity: Minor issues requiring action, but not affecting customer ability to use the product.
  • Level 4 Severity: Minor issues requiring action, but not affecting customer ability to use the product.

When developing severity levels for your own organisation, be specific and metrics-driven—for example, referring to percentage of users and accounts affected.

Responding to incidents

Using the metrics above, any Level 1 or Level 2 incident would be considered major.

Severity levels can help companies quickly—and more objectively—put an incident in context so it can then determine how to deal with it.

Generally speaking, the more severe an incident, the more extreme the response. For example, a SEV 5 incident may require only the submission of a help ticket. A SEV 1 incident, on the other hand, might warrant notification to all internal stakeholders, as well as to customers and even the general public.

Determining severity levels, and appropriate response at each of the levels, is key to effectively defining “incident” for your organisation. But it’s also important to note that few things in business—or in life—are well defined. Things will happen that don’t fit neatly into your SEV levels. If you think it’s an incident, chances are it is indeed an incident. What’s important is to figure out whether it’s major; if so, it’s better to err on the side of caution and rank it higher rather than lower on the SEV scale. After all, it’s easier to scale back response than it is to scale up.

Preventing incidents

While it’s true that incidents happen, and that the way in which an organisation responds provides a measure of its resiliency and agility, the ultimate goal is to prevent an incident in the first place.

Organisations whose key stakeholders have collaborated to effectively define and classify incidents will likely find themselves responding to incidents less and less frequently over time.

For one thing, the discussions that go on across IT, dev and business to facilitate incident definition can shine a brighter spotlight on critical systems and the protections and optimisations that need to be in place to ensure availability and high performance.

Such discussions can also shine a new spotlight on systems whose criticality might not have been fully understood. For example, social media platforms might have been low on the priority list for IT and DevOps until marketing justified making the loss of social metrics systems a SEV 2 incident.

Incidents that do occur can also help companies prevent future incidents. Done right—with structure and positive intentions—an incident post-mortem enables organisations to improve future response, mitigate customer impact over time, and implement products, policies and people that will prevent incidents from occurring in the first place.

In a perfect world, an organisation’s incident response system would never be put into action. In the real world, incident response dictates how--and how well--an organisation keeps pace with the rapid pace of change and complexity in today’s business environment. Developing common understanding of just what determines an incident is the first step toward ensuring that companies not only keep but exceed that pace.  

Steve Barrett, Vice President of EMEA, PagerDuty
Image Credit: Dotshock / Shutterstock