IT outages are a common occurrence among the world’s largest organisations. The likes of British Airways, Google, Instagram, TSB and even IT firm Cloudflare have all been affected, leaving millions of customers without service. Much like death and taxes, outages are regularly touted as an unavoidable fact of life. Yet they are eminently preventable. Ponemon Institute research has revealed that human error is actually the second most common cause of system failure, accounting for 22 per cent of all incidents.
With this in mind, it’s clear that firms need to explore every avenue to minimise outages. With estimates putting downtime costs at $300k an hour and thousands of customers potentially ready to flee, there’s no time to wait. The direction all firms should be travelling in is towards a unified, automated IT operations monitoring solution that provide a single pane of glass across the entire IT estate.
What drives mistakes?
In a small business running around 50 devices, consistency is not a major issue and the risk of error remains fairly low as time can be spent assessing and monitoring each individual technology. However, the likelihood of IT outages rises dramatically when the environment is scaled up and IT teams have to monitor a larger number of devices. It’s not just connected machines, either, but networks, cloud and on-premises servers and any other part of the IT infrastructure. In organisations with 10-20,000 hosts, it’s extremely difficult to ensure that effective and consistent monitoring is taking place across the entire IT ecosystem.
When teams are stretched thin, it’s easy for individuals to overlook a minor event which may cause a major issue later down the line. If they have other tasks to complete and deadlines to meet, carrying out consistent end-to-end checks across multiple systems is highly challenging, as concentration will inevitably lapse. As a result, issues are missed and IT staff find themselves in a familiar situation of firefighting and plugging gaps retrospectively. This reactive approach, which is partly the result of poor visibility into IT operations, ends up requiring a great deal more time, money and effort to support.
Take Amazon Web Services, which suffered a well-publicised outage in 2017 that lasted hours. Despite its high-tech credentials, the firm suffered major downtime all because of a single staff member who simply typed in the wrong command, taking a key server offline. One small lapse in judgement can have far-reaching consequences for an organisation.
Of course, it’s easy to point the finger at the employee who made the mistake in such situations. But in many ways, it is rather the business itself that is to blame. No matter how good your employees are, mistakes will always be made. Give them enough monotonous tasks in a row and consistency will drop. Without automation and clear visibility from a single, unified IT monitoring tool, problems could get out of hand.
Automating your way to success
Automation will never remove human error completely – nothing can. But, it’s a great way to ensure consistency on a large scale. Of course, automation relies on thorough configuration, so if systems are set-up incorrectly it can easily cause problems across a large number of devices. But, when done right, and combined with unified IT monitoring, it’s capable of delivering a number of benefits.
For a start, automated IT Operations and monitoring can significantly increase the likelihood of identifying small issues before they cascade into much bigger ones. When a server issue impacts related systems, the IT team will naturally attempt to fix the original server – but may miss all the other affected devices. Automated monitoring helps to rectify this kind of problem. When problems do arise, an alert is automatically flagged to a member of the IT team who can quickly identify the root cause and take measures to resolve it. Nothing is missed and the fix is much quicker.
Another benefit of automated IT Operations is in providing a clear timeline of events. This gives technical staff a full break-down of what has occurred and when, flagging any anomalies that might be buried within the system. Even if initial problems were missed, issues can easily be traced back to their source, which is not always possible when humans are overseeing thousands of jobs per day. Data doesn’t lie.
This in theory also helps to speed fixes and reduce costly downtime. And with less time spent on firefighting, IT professionals are freed up to work on higher value tasks. They may even get more enjoyment out of their jobs – increasing productivity. Automation could also have an impact on the structure of the IT team in the long-term. Fewer staff may be required as there are fewer manual tasks to complete. Gartner suggests that using automation could cost as little as 20 per cent of an employee’s salary – freeing up capital for more highly skilled workers. Where once there were many employees manually plugging gaps and fixing issues, in the future there could be fewer, highly trained employees capable of helping with more strategic projects.
A single pane of glass
The truth is that the expectations of modern consumers are extremely high and rising all the time. This means downtime can result in significant customer attrition, brand damage and a serious impact on the bottom line. At the same time, complex digital transformation initiatives are placing increasing demand on already stretched IT teams. That makes automated IT Operations a must-have, to provide insight where you need it most across the entire IT stack and out to systems managed by third-party providers.
Armed with a centralised tool organisations can finally overcome persistent challenges like tool sprawl, which perpetuates IT siloes and inundates teams with data, slowing response times. With a single pane of glass solution that works across the technology environment, IT managers finally have the visibility they need to spot problems early on and minimise downtime, driving strategic value for the entire organisation.
Neil Ferguson, VP, Sales Engineering, Opsview