Skip to main content

The changing face of IT operations management

(Image credit: Future)

During a time of constant change, many of Europe’s largest companies are still betting on legacy, command-and-control-based IT incident management practices. With customer expectations at an all-time high and tolerance at an all-time low, over dependence on traditional incident management systems puts organisations at a distinct competitive disadvantage. To stay in the game, organisations need to do nothing short of changing the face of IT operations management.

Machine learning curve

Consumers today expect services to be delivered in real time and will put up with a less-than-perfect online experience for only a few seconds. At the same time, infrastructure complexity is increasing. Cloud computing, distributed architectures and IT modernisation initiatives bring with them more technology, applications and signals than ever before. When incidents do occur, responders receive data from a multitude of sources. It is easy for them to become overwhelmed - particularly when attempting to triage a complex situation.

Forward-thinking organisations are using machine learning (ML) to help triage incidents and provide insights - drawing on data from not only their own organisation and potentially other organisations around the world. ML puts context directly into the hands of those closest to an incident, leading to swifter resolution and better customer experience.

But soon, even real-time technology won’t feel fast enough. We will need to go one step further to predict what’s coming before it happens. This will mean looking for signals and patterns – just as a meteorologist might do. Patterns from past weather events can indicate, for example, that there's an 80 per cent chance of a category three hurricane becoming a category five by the time it hits land. Likewise, large sets of accurate data can provide context and highlight emerging patterns, revealing the degree of probability of a major IT incident occurring. With a little help from artificial intelligence (AI), prediction is within reach.

However, for ML and AI to be successful companies will need to find a sweet spot where they put human power and human thinking first, then add in just the right amount of ML and AI to make things more efficient. It’s about ML and AI becoming our allies not our replacements.

Alert strategy updates

The boardroom expects IT to help mitigate the business impact of incidents on the bottom line. Yet, IT can’t address what it can’t ‘see’. Rather worryingly, a recent study commissioned by PagerDuty, Unplanned Work: The Human Impact of an Always-On World, suggests companies are more likely to learn about major IT issues from their customers than their own systems.

It’s critical to continually review alerting strategies to ensure they flag what customers - both internal and external - care about most. Availability and performance are obvious factors. Others will depend on the nature of the business, but, in general, as the business evolves, so must the alerting strategy.

Communication counts

IT incidents can take a significant amount of time to resolve. That’s an issue for technical responders, of course, but it’s also an issue for business stakeholders on the front lines - the people trying to help customers whose experience is being compromised while the IT problem is being fixed.

Think about an airline whose ticketing system has been compromised. While technical responders are scrambling to diagnose and fix the issue, customer service agents are dealing with increasingly frustrated customers. When agents can only respond, “I’m sorry, there’s a technical issue. I don’t know when the system will be back up,” the negative impacts of the incident will likely spiral as customers vent their frustrations on social media.

Forward-thinking IT organisations are ready to communicate. They think about the stakeholder groups that will need to be updated - typically, executives, customer support and sales - and ensure that they have the most up-to-date information on whom to contact within the groups and how to contact them.

They also know good service isn’t just about responding to questions. It’s about anticipating them. By being proactive they decrease the number of support tickets opened. They automate what they can. They make it easier for internal stakeholders to get information without reaching out directly to the incident commander and distracting them from the job in hand. 

Indeed, it’s easy to think of analytics as something to be applied retrospectively--as part of the all-important incident post-mortem. But today’s executives expect IT to help them understand the business impact of incidents as they happen. Fortunately, new tools enable them to do so - curating information in real time on not only service and team health, but also the total cost of the incident and response.  Business response is now an extension of incident response.

New success metrics

Many of today’s organisations are fixated on the reliability of their technology. That’s understandable. But as any developer will tell you the reality is not if it will fail, but when it will fail. We can expect to see success metrics shift to resilience, or how quickly companies can recover from failure. As industry analyst Forrester puts it, companies will need to, “Design for dependability, not just availability.” Resilience engineering is likely to break onto the scene in a major way in 2020. We’ll start to see European organisations, especially the larger ones combine the benefits of tool automation with resilience techniques based in collaborative resolution.

Automate where possible

The costs associated with IT incidents are high. When the volume of incidents and alerts goes unchecked, IT teams face burnout and lost productivity. This dramatically affects an organisation’s ability to innovate, which increases competitive and brand risk.

So, although the new challenges facing those responsible for IT operations management can seem insurmountable, there are ways around them.  Our research shows a 20 per cent cut in unplanned work for companies that automate incident response. By automating what they can, and using evolving capabilities such as ML and AI to cut down inefficiencies, companies are still able to focus their greatest (human) assets on areas where they can create and derive the greatest value.

Steve Barrett, VP, EMEA, PagerDuty