Skip to main content

How to transition IT operations from fighting fires to outage prevention

In organisations large and small, there's often a disconnect between the top office and the troops in the field – a disconnect that can cost a lot of time, effort, and money. When IT leadership doesn't fully understand why systems are going down, they cannot ensure such IT outages won’t recur. Thus, it's incumbent on management to institute a system that will provide them with the information they need in order to transition from constant crisis management to proactive prevention.   

That, at least, is how it is supposed to work – but executives who try to access that information often end up getting lost among the trees in a mighty forest. The truth is that IT executives and experts have a hard time grokking what is going on. Modern networks and data systems are just too complicated, with too many details and potential point of failures.   

To wit: A recent study by the University of Chicago ranked the 13 leading causes of service outages at online service companies, but the data applies to IT departments too. The study examined data of about 516 unplanned outages, in the hopes of figuring out what happened and why. 

Who to lay the blame on?

The study found, for example, that software and system upgrades were a factor in about 15 per cent of the outages. Changes that looked good to go on test servers had a “bad reaction” when they met up with the full ecosystem, the researchers discovered – requiring a rollback of the changes while the IT department figured out how to perform the upgrade without knocking out service.   

Misconfiguration accounted for another 10 per cent of the outages, with incorrect data written to configuration files. While some of those errors were made by IT workers, many were not; newly installed or upgraded software often rewrites the configuration files, with the details buried in a manual and unlikely to be recognised – or discovered – by IT workers without a great deal of effort.   But the most common cause of outages – and the one that should raise the most concern among executives – is the fact that nearly half of the outages were caused by “unknown” factors. In a total of 294 of the 516 outages, the team was unable to determine what the problem was.  

If a top team of university researchers couldn't figure it out, it's unlikely that a harried and probably understaffed IT department would be able to either – and if they can't, what hope have the executives, or indeed the organisations they lead? 

Automated systems are needed

The only way is to get some outside help – an automated system that does constant analysis on what can go wrong with a network, and immediately alerts IT executives when something happens that can cause an outage. Fortunately, such solutions do exist. The top-tier ones don’t merely analyse log and configuration files; they aggregate information to provide a complete view of all layers of the infrastructure, ensuring that systems are aligned with best practices recommended by their providers and work properly with each other.   

When changes are made, the system knows, and alerts the relevant parties. It also draws upon data generated by outages from other organisations, comparing the current situation with those outages to look for likely causes, and solutions. All the data is presented in a clear, concise manner – allowing IT teams and executives to immediately focus in on the problem and solve it.   Implementing a system like that will not only enable both the IT team and the executive suite to sleep more soundly – it will encourage and enhance best practices within an organisation. 

As networks and systems get more complicated and new equipment, services, databases, etc. are added, there is a need to ensure that integration will be quick and efficient – and that their configurations, besides being aligned with vendor recommendations, are compatible with the current environment. In addition, the feeling of control that executives derive from the data will allow them to make smarter decisions about allocating resources and attention to areas that pose the greatest risk to IT service availability and reliability. And, executives will be able to keep their fingers on the pulse and proactively address any newly-detected vulnerabilities before they turn into an outage or service disruption.   

IT systems today are far more complicated than ever before. Without immediate visibility to the current state of IT infrastructure based on automated analysis across all infrastructure layers, there's little hope companies will be able to avoid significant problems somewhere along the line.

Doron Pinhas,  CTO at Continuity Software
Image source: Shutterstock/everything possible