2018 goal: Eliminating ‘black holes’ in the IT Stack

In this appropriately-termed Information Age, there is plenty of information – yet, at least according to some, the data overload we face just makes us more confused. The same goes for enterprise: the collective haystack of data in enterprises – call it the data stack - grows and grows, with new applications, connections, and services recording their every action, making searching for the needles more difficult than ever.

But it's not just the needles we have to search for, metaphorically speaking – it's specific needles, that perhaps look more like toothpicks. How do you differentiate? The information we need in order to analyse why things happen in an organisation is readily available, but it's not necessarily easily attainable, discoverable or actionable. Log files with large amounts of data may “have the answers,” but how can even the most experienced IT worker be expected to mine through that data and come up with an answer on how to solve a problem?

The answer to that question is crucial. Outages aren't just about inconvenience to customers or employees - they're all about the money. According to a Rand Group survey, 98 per cent of organisations say a single hour of downtime costs over $100,000 – and more than a third of organisations say they lose as much as $5 million in one hour of downtime! That, of course, is in addition to the indirect costs to an organisation – loss of confidence by customers who can't access the services they need, a pall on the organisation's reputation for reliability, the inevitable criticism by shareholders/investors/board members, and so on.

So, getting definitive answers, and quickly, is key. But getting those answers is harder than it appears, despite the huge data resources available to organisations. A typical enterprise IT system could have thousands of log files, recording the activity of the many services and applications provided. And because the providers of those services/applications want users to feel they are getting their money's worth, they throw in all sorts of bells and whistles, including in-depth and/or multiple log files to be examined and analysed in case of a problem, to generate usage reports, etc. IT personnel expect to have access to that kind of data, and vendors, eager to provide features to make their products more attractive to IT buyers, of course provide full information on what their application or service does.

'Unknown events'

Obviously, no one is going to peruse these log files manually, so IT personnel write scripts or use commercial packages to look for the data they think they need. But what do they need? Do they really know what to look for? Is searching for a term enough? The answer, according to a University of Chicago study, is no, which showed that the most common reason for outages in organisations with cloud services is - “unknown.” In fact, “unknown events” accounted for the largest number of cloud outages in the 597 unplanned outages that occurred within a 7-year span from 2009 to 2015 in the large enterprises studied, which was more than upgrades, bugs, configuration issues, and even human error. What's especially notable is that the study takes into account not just initial reports, but post-incident analyses, in which it could be presumed IT teams would employ whatever forensic tools were at their disposal in order to determine the cause of the outage.

Even after such analysis, teams couldn't definitively figure out exactly what was wrong, event months later. That according to the study, many outage reports are “vague” as to the outage cause in more than half of the cases studied (355 out of 597) is particularly noteworthy, considering just how much data there is documenting the problem. Don't be too hard on those IT teams, though, they're just people, like you and me, and with the limitations of the human body and mind, there is only so much data we can absorb and process before the mind goes into “data overload” mode, and begins blocking out information.

Add it all up, and you realise that what is missing is not the data, and not the way to search for it, but what to do with it once you find it. Clearly this is not a job for man, but for machine – preferably one armed with artificial intelligence that can utilise tools like machine learning and other AI technologies to analyse files. With thousands, if not millions of log files, config files, and other sources of information on large networks, “outside assistance” of this type is the only way to get definitive answers quickly enough to prevent a repeat of the problem.

But even with AI systems in place, IT staff needs to know how to tap into their capabilities. The requests to the system need to be structured in a manner that will ensure that the right information is searched for and provided. Most data structures are coded by programmers for specific purposes, so a good knowledge of coding is important. Thus, the structures need to be developed in a way that will draw out the actionable information the organisation needs to ensure that the outage is not repeated – for example, “did a new software installation cause this outage?” Installations often rewrite existing files and configurations, breaking dependencies and impacting customised systems. Although IT teams usually have very specific installation or upgrade procedures in place, it's not uncommon for a new team member who may not be aware of the procedures to make a wrong move. Once that's done, it's almost impossible to unravel the chain of events without a deep, AI-style analysis.

Considering the effort that goes into developing these structures, the best move for an organisation would be to use an AI analysis package that includes pre-determined data structures; the objective of AI analysis, after all, is to get back online as quickly as possible, and to figure out what the problem was in order to ensure that it does not happen again. The lesson for companies determined to prevent outages is this: Your IT team may be top-tier, but there are tasks that are beyond it – and beyond the abilities of any human being. Solving the “black holes” in the data stack is – by necessity – going to be a joint human-AI system effort. It remains for organisations to pick the right AI partner.

Gabby Menachem, CEO and Co-founder of Loom Systems
Image source: Shutterstock/hafakot