Skip to main content

Death by a thousand cuts: why it’s not just the big IT outages that we need to be aware of

(Image credit: Image source: Shutterstock/hafakot)

The consumerisation of IT and the ‘instant gratification’ economy is driving the need for consistent high performance from businesses. Remaining accessible to customers 24 x 7 is now mission critical to business success, which has led to a culture where instant access and being ‘always connected’ is not just highly valued, but expected. As such, in this modern and tech-driven world, IT outages are now unacceptable and can severely damage a business’ profitability and reputation when they do occur. This has been seen with the recent IT outage at O2, which left thousands of customers unable to use their mobile data for over 24 hours and the larger outage at TSB, when a failed IT systems migration ended up locking millions of customers out of their accounts for weeks. These events had huge knock-on effects on the businesses, and in the case of TSB, this outage ended up costing the bank £200 million and continued IT faults ultimately led to the resignation of CEO Paul Pester.

However, while large IT outages dominate the news and cause instantaneous outcries from customers on social media, these aren’t the only threat to business growth and reputation that companies need to make sure they are able to protect against. The smaller and potentially hidden failures also need to be treated as critical events because, at the end of the financial year, even the smallest IT performance issues can mount up. For example, if a website is slowing down for an hour each day, customers visiting the website at this given time can easily get frustrated with the user experience and move to another supplier. Furthermore, a specific internal IT failure might cost just £100 in staff time and lost opportunity each time it happens. In an isolated scenario this amount doesn’t immediately sound like a lot and may be easy to absorb as part of planned cost. However, this can quickly escalate – if this event happens 15 times per month, that’s costing the business £4,500 each quarter, or £18,000 per year. Suddenly, this cost doesn’t look so minuscule after all.

To make things more complex, these smaller performance incidents are more difficult to identify as many businesses do not have a unified view over the entire network. This originates from the fact that behind a new breed of innovative customer and employee-facing digital services lies a hotchpotch of disparate and decentralised systems – virtual machines, hybrid cloud accounts, IoT endpoints, physical and virtual networks and much more. These disparate, decentralised systems don’t talk to each other, and they frequently fail. As well as this, many of these systems are outside the control of IT, adding an extra layer of opacity and complexity.

The importance of unified visibility

With consumer expectations so high today, there is little room for error, and companies need to ensure that they are gaining full and centralised visibility across the entire network. The value of visibility is in enabling teams to learn how certain activities impact performance, so they know where the bottlenecks are that could delay projects or cause outages. Yet, gaining that insight is a persistent challenge. This is due to a number of reasons – firstly, it could be because the tools being used were designed only to monitor static, on-premise infrastructure of the past, rather than the modern, dynamic, cloud and virtual-based digital systems of the present. But more commonly, it’s because organisations are using multiple tools, producing multiple versions of the truth for siloed IT teams.

It is true to say that in many ways, IT operations has been left behind by digital change, as many organisations still view it as an afterthought: major investments in new apps and services are not matched proactively by improvements in performance monitoring. Part of this is down to perceptions of IT Operations and monitoring as a cost centre rather than a value driver, but this is because in many cases firms aren’t monitoring the right things. Simply focusing on availability rather than business service performance will not deliver strategic value. Furthermore, without visibility into the performance of applications and systems, early-stage problems can be missed which then end up snowballing into major incidents, such as IT outages. It has already been seen that these types of incidents have a huge financial impact. Research from Gartner revealed that on average IT downtime costs about $5,600 per minute. Although the analyst admits that the figure could reach far higher — $540,000 per hour — at the top end.

This lack of unified and proactive investment in IT operations is compounded by recent research from analyst firm Enterprise Management Associates, indicating that a vast number of organisations have more than ten different monitoring tools, meaning it can take businesses between three-six hours to find the source of an IT performance issue. This is clearly unsustainable – it leads to unnecessary load on the IT environment and wasted budget on training, implementation and integration.

Breaking down industry silos

One of the most damaging consequences of tool sprawl is that over the years it has exacerbated the “silo-isation of IT, where various teams rely on often disparate views of monitoring, and are unable to find common ground.” Related to this, and perhaps even more detrimental, is the fact that it delays ‘Mean Time To Repair’ by creating too many data points, causing many IT teams to be overwhelmed by the sheer volume of applications running on multiple servers. This increasing complexity means they struggle to understand which alerts really matter to the business, where these small outages are taking place or what to prioritise in the extreme event of an IT outage. Thus, these age-old monitoring techniques and approaches are still being widely used but they inherently drive inefficiencies and increase risk.

There are times when outages can occur suddenly and without warning. In such cases, it’s vital to detect the failure quickly, and know the impacted systems. Once identified, organisations should have processes in place to rapidly mitigate the issue – reducing downtime and lost revenue. It can also help businesses to take preventative action by spotting the patterns which often precede a failure — reducing internal business bleeding by locating and fixing small IT outages

Only by unifying IT operations and monitoring under a single pane of glass can an organisation hope to get a holistic view of what’s going on. A centralised view ensures that there is only a single version of the truth; bringing siloed teams together, avoiding duplication of effort and more importantly, ensuring that monitoring finally fulfils its promise to improve service performance, availability, and the user experience.

Mike Walton, Founder, CEO, Opsview
Image source: Shutterstock/hafakot