As the business demand for IT services increases, so too will the frequency and criticality of technology incidents. These incidents happen to companies large and small, across every vertical. They also happen in varying degrees of impact, from inconvenient to potentially catastrophic. Crossing cybersecurity attacks, network outages, application slowdowns, hardware failures, and disaster recovery, these incidents demand a quick response. And while many organisations take an organised approach to addressing and managing the aftermath these IT incidents, there is an opportunity for them to significantly cut down on response time.
But the question remains… How do they achieve this? Most companies begin to answer that question by measuring Mean-Time-To-Repair (MTTR), resolve or otherwise restore whatever has gone wrong to its rightful pre-incident state. That time starts from the point of incident determination, whether through proactive monitoring solutions or complaints to the service desk, and encompasses the time it takes to respond, investigate the cause, fix the issue, and test or validate the resolution. In addition to measuring MTTR, here are some ways to prepare your incident response team and better prepare for IT response incidents.
Ensure you can engage the “right” response team quickly
Having team members that are prepared to quickly respond to any IT incident can set an organisation up for success. Ensuring that the teams are equipped with the proper training and technology and has run test and practice incidents can help them better prepare. Most war rooms average 15 people (a number that is on the rise in recent years), and the makeup of those teams can shift depending on the nature, likely origin, and impact of each incident. Crossing functional boundaries, teams include talent from IT Ops, Infrastructure, Network, Customer Service, Technical Support, Dev, Security Ops, 3rd party providers, Senior Management, Legal, and line of business. It is not uncommon for these groups to have uneven response time policies across departments, which can create additional delays and latency.
On-call schedule management is crucial to drive people’s accountability and make the management of the critical incident a science rather than guesswork. For mid-tier and large-sized companies, the process of identifying, assembling and initiating collaboration was the single most problematic of all processes. Members of the team need to be involved in moving from initial awareness through investigative diagnosis to fixing, and testing. Thankfully, smart routing technology is available to identify and contact the best responder team members based on on-call scheduling and other criteria.
Recognize the benefits of cloud-based communications
Most organisations rely on internal email to communicate in the event of a crisis, although the nature of today’s cybersecurity landscapes can make this a dangerous gamble. Email is likely to be the first service attacked. According to FireEye, 91 per cent of cybercrime starts with email. Organisations must be aware of their cybersecurity vulnerabilities, since communicating via email during an incident could exacerbate the issue and potentially provide hackers with critical company information.
Organisations should consider utilizing a cloud-based analytical communications platform, which operates entirely independent of an internal communications network. This technology automates the time-intensive emergency cascade process, ensuring that resources can be deployed far more effectively and efficiently than before and that the safety of everyone involved is better protected. In doing so, communications technologies can not only help protect business assets but save the lives of employees. In an emergency, organisations cannot waste time searching spreadsheets and schedules to notify employees manually.
Develop a system for constant collaboration
Collaboration is the key to ensuring a successful incident response, which makes it essential to get your right teams working right away in real-time collaboration and orchestration. Smart orchestration technology automates notification of various groups (IT staff, key stakeholders, impacted business users) and people using preferred delivery methods and combinations of contact.
Proactive notification of impacted business users improves transparency and reduces frustration, strengthening IT/business relations. It also keeps the service desk from being swamped with redundant calls. Owners can offer one-click or pin-less conference bridge access, as well as ChatOps channels like Slack or Teams. For known issues, remediation runbooks can be triggered and executed directly from the notification the IT staff receives.
Ensure visibility across the entire organisation
As more industries embrace digital transformation, the line between IT and the business objectives can blur. This makes incident response performance visible across all of IT extremely critical. One way to achieve this is by utilizing interactive dashboards that provide heat maps, which allows teams to understand where the staff is located at all times. Additionally, smart analytics provide incident response performance trending by a group, time, or type. In addition to resource planning and SLA adherence, this visibility helps diverse groups see a natural landscape across service, operations, security, DevOps and IT BC/DR.
There are several things to measure to see how your incident response performed. Start by looking at the time saved. This means looking at factors like the current average time required to identify, locate, notify, engage, and initiate collaboration of the people who have the skillsets and availability to work on a significant incident or outage. A Fintech Futures research shows that the average time to assemble the right team is over an hour, with 30 per cent coming in under an hour and 30 per cent over 1.5 hours.
Next, examine the cost of downtime. There are figures abound from all sorts of studies estimating the cost of downtime or outages per minute or even per second in high-volume transactional businesses. The actual figure will vary wildly depending on the size of the organisation and the nature of the business. Most companies have some idea of their average costs and should use their numbers. If one is not available, $5,000/minute would be a conservative stand-in figure for a midsized concern, according to Uptime Institute’s Eighth Annual Data Centre Industry Survey Report.
IT incidents happen, and in a world of competing demands and priorities, it’s not always easy to cut through the clutter of and make the right decision. By using IT Alerting and Incident Response Automation solutions across departments and bringing visibility into the situation, IT professionals will be equipped to deal with downtime swiftly and get businesses back online sooner.
Vincent Geffray, Senior Director of IT Alerting and IoT, Everbridge
Image source: Shutterstock/hafakot