Terror? Hackers? The real reasons for IT outages are far more boring

In our hyper-connected and hyper-sensitive world, the immediate reaction among many people when they hear of a major service or IT outage – and especially when numerous incidents occur simultaneously – is “terror,” or at least “hackers,” who may or may not be associated with terror groups.

Such rumors were flying on 8 July, 2015, when three major outages occurred. On that day, the New York Stock Exchange stopped working, United Airlines had to ground hundreds of flights due to a service problem, and the Wall Street Journal website went down.

Was it terror? Some kind of conspiracy against Wall Street and their cushy corporate business trips? Script kiddies? Anonymous – which in a tweet from an account allegedly associated with the group, wondered whether 8 July “will be a bad day for Wall Street?”

As usual, the truth is far less dramatic, if not as damaging, because losses are losses, regardless of how they are caused. United was up and running after a two-hour delay, in which some 800 flights were delayed and about 60 canceled; the company attributed the problem to “reduced network connectivity” due to a replaced router.

The NYSE outage resulted in a suspension of trading at exactly the same time as the United outage (further fueling conspiracy theories), which lasted for about four hours, with all trades halted and pending trades cancelled. In a statement, the NYSE said that the outage was due to “the rollout of a software release” that was loaded onto computers “not loaded with the proper configuration compatible with the new release.”

The WSJ never said what their issue was, but after being out of service for several hours, the site slowly returned to life – apparently the victim of overload by worried traders and investors who were looking for updates on why the NYSE was down.

The truth is that those were far from the only outages and glitches that day – or any day, where there could be dozens of partial or full service outages affecting a few hundred customers, or millions. It's usually the latter that make the news, but if you are among the hundreds affected, the service outage in question likely looms very large in your life.

As it turns out, nearly all those failures are due to internal fails of IT infrastructure and processes, so companies really have no one to blame but themselves. The implications of that differentiation are significant; while a company could rightly say that they did everything right and were just innocent bystanders of an Anonymous DDoS attack (not that there aren't ways to stop those), there's no shirking responsibility when it comes to failure to take all necessary steps to ensure resiliency. Shouldn’t the NYSE, for example, have checked to see that “customer gateways were loaded with the proper configuration compatible with the new release?”

The answer is probably affirmative – but for many companies, the ability to do that kind of checking is beyond their capabilities. There are dozens, if not hundreds of factors that can be impacted by a software upgrade or any of the many configuration changes that take place daily in a complex IT environment. To expect even the best IT staff to be able to decipher the structure of an IT landscape as complex as the NYSE's (or UAL's, WSJ's, or even less famous names) and to foresee the implication of every configuration change is setting a very high bar, one that most human beings are very unlikely to meet.

It's here that some help from advance analytics could come in handy. A predictive analytics process could determine in advance if the impact of a system change is likely to enable IT staff to prepare and alleviate problems before they happen. By analysing the dependencies and interactions between all layers of the infrastructure and proactively identifying recent changes that could introduce risks to stability, IT teams can gain increased visibility to ensure that all systems and services will continue to work as intended.

Could such a process have prevented the outages at the NYSE and UAL? Hindsight is 20/20, of course, so it's easy to say yes; but if the IT folks at those institutions - and others that have been able to stay out of the headlines for now - are as smart as they are supposed to be, they will likely be looking at IT analytics to help protect their customers as well as their bottom line.

They will save us yet another round of worrying about how “the hackers” are winning. We all know we have enough to worry about as it is.

Yaniv Valik, VP Product Management & Customer Success, Continuity Software

Image Credit: alphaspirit / Shutterstock