How to deal with a major IT incident: Eight top tips

Each time a major incident, such as a payroll crash, happens, the IT team gets into a fire-fighting mode and takes the resolution process to a whole new level.

This doesn't have to be the norm, if you can follow these 8 best practices. In no time, you would resolve the major incident, and with no panic.

Clearly Define a Major Incident

When an issue causes a huge business impact on several users, you can categorise it as a major incident. It is one that forces an organisation to deviate from existing incident management processes.

Usually, high-priority incidents are wrongly perceived as major incidents. This is probably due to the absence of clear ITIL guidelines. Therefore, to avoid any confusion, you must define a major incident clearly based on factors such as urgency, impact, and severity.

Have Exclusive Workflows

Implementing a robust workflow helps you restore a disrupted service quickly. Separate workflows for major incidents help in seamless resolution. Focus on automating and simplifying the following when you formulate a workflow for major incidents. Also have a no-approval process for resolving major incidents:

  • Identifying the major incident
  • Communicating to the impacted stakeholders
  • Assigning the right people
  • Tracking the major incident throughout its life cycle
  • Escalation upon breach of SLAs
  • Resolution and closure
  • Generation and analyses of reports

Reel in the Right Resources

Ensure that your best resources are working on major incidents. Also, clearly define their roles and responsibilities because of the high impact these incidents have on business. You could have a dedicated or a temporary team depending on how often major incidents occur.

Some organisations have a dedicated major incident team headed by a major incident manager, whereas others have a dynamic, ad hoc team that has experts from various departments. Your primary objective must be to keep your resources engaged and avoid conflict of time and priorities.

Configure Stringent SLAs and Hierarchical Escalations

Define stringent SLAs for major incidents. Set up separate response and resolution SLAs with clear escalation points for any breach of the process.

In addition, follow a manual escalation process if the assigned technician lacks the expertise to resolve the incident. Moreover, ensure that a backup technician is always available.

Keep Your Stakeholders Informed

Throughout the life cycle of major incidents, send announcements, notifications, and status updates to the stakeholders. Announcements in the self-service portal will prevent end users from raising duplicate tickets and overloading the help desk.

Also send hourly or bi-hourly updates during a service downtime caused by major incidents. Have a dedicated line to respond to major incidents immediately and offer support to stakeholders. Use the fastest means of communication, such as telephone calls, direct walk-ins, live chat, and remote control desktop, instead of relying on email.

Tie Major Incidents with Other ITIL Processes

After a major incident is resolved, perform a root cause analysis by using problem management methods. Then, implement organisation-wide changes to prevent the occurrence of similar incidents in the future by following the change management process.

Speed up the entire incident, problem, and change management process by providing detailed information about the assets involved by using asset management.

Improvise Your Knowledge Base

Formulate simple knowledge base article templates that capture critical details such as the type of major incident the article relates to, the latest issue resolved using the article, owner of the article, and the resources that would be needed to implement the solution.

Create and track solutions separately for major incidents so that you can access them quickly with very little effort.

Review and Report on Major Incidents

Document and analyse all major incidents so that you can identify areas of improvement. This will help your team efficiently handle similar issues in the future. Also, generate major incident-specific reports for analysis, evaluation, and decision making.

You could generate the following reports to help in efficient decision making.

  1. Number of major incidents raised and closed each month
  2. Average resolution time for major incidents
  3. Percentage of downtime cause of major incidents
  4. Problems and changes linked to major incidents

Major incidents are unavoidable and each one is a learning experience for your team. Adhering to these practices could be your first step towards mastering the art of handling major incidents.

Prithiv RajKumar, marketing analyst at ManageEngine

Image source: Shutterstock/dotshock