IT Operations professionals have always dreamed of being able to predict incidents, i.e., service interruptions and outages that negatively impact the business. Those dreams began to convert to serious hopes and even demands around 2010 with the emergence of commercially viable big data and analytics solutions.
The idea here was that, with sufficient quantities of data at hand, and sufficiently powerful statistical analysis tools, deep correlational patterns would emerge from the data. Nirvana was that future events could be predicted with a reasonably high degree of accuracy. Put another way: IT Operations professionals - like many business professionals - were convinced that if one had enough information about the past, the future could be revealed.
Over the next five years or so, enterprises slowly concluded that data and traditional statistics alone would not turn the IT department into Nostradamus. This was the result of unsatisfactory experiences with vendors like Netuitive, Integrion, and ProactiveNet. As data volumes grew, so did the amount of corrupt and duplicate data as a percentage of the total. This undercut much of the value of having easier access to greater quantities of data.
At the same time, statistical methodologies were, if applicable at all, much better at delivering correlated information about the past, rather than causal actionable information about the future. Indeed, growing frustration with the inability of the big data/analytics approach to yield insights into the future was one of the reasons why the market turned to artificial intelligence (AI).
The rationale here was that the failure of the big data/analytics approach was due to the inability of the human mind, unaided, to make sense of such large data sets - even with statistical analysis software. What was needed was some kind of automated mental prosthetic that would, in fact, boost the ability of eyes to see and brains to think. What would emerge is the patterns that governed large data sets and, with those patterns in hand, the ability to predict the future.
The emergence of predictive analytics
Now, of course, the application of AIOps (AI applied to IT Operations use cases) is about a lot more than predicting the occurrence of incidents in the future. It has even been argued by some (even me back when I was a Gartner analyst) that using AI-based algorithms to attempt to predict the future is a fool’s game and bound to end in failure.
Nonetheless, under the term ‘predictive analytics’, many vendors today are positioning AI-driven prediction as core to the AIOps value proposition. In fact, I now think they are right, and I was wrong. Prediction, suitably understood and circumscribed, is a legitimate and valuable function of AIOps technology. However, it is also true that the way in which most vendors and many users understand what prediction is and how it should be delivered is fundamentally wrong and will lead to disappointment.
In the paragraphs that follow, I will first outline the common understanding of AI-driven predictive analytics and why it is likely to fail in its goals. Then I will present an alternative way of looking at how one can achieve genuine insight into the future IT system behaviours and misbehaviours.
The classical approach
AI-driven predictive analytics is usually seen as an automation of the practise of classical statistics. Keep in mind that classical statistical methodology has itself been enhanced by the availability of cheap computing power. Classical statistics comes in two flavours: Frequentist and Bayesian.
Frequentist. The frequentist approach starts with a collection of almost identical equations, each capable of describing a data set. The equations differ from one another only in virtue of a small set of parameters. Presented with a data set, the classical frequentist statistician decides which parameter value the equation can take, and plugs the value in. This is then tested against some further data sets, and the parameter value adjusted if necessary.
Of course, there are many techniques for selecting the parameter value and for conducting the tests, but most of these techniques can be automated. That is, if you start with a collection of equations that differ only with regard to the parameters, and a collection of data sets to help you choose the right parameter, a deterministic algorithm will deliver the desired results.
What does this have to do with prediction?
Well, once the parameters have been selected, one is in possession of an equation which purports to describe the real world. This is ultimately responsible for generating the data sets. The fixed equation itself typically describes a function, which can itself be described as a conceptual machine that, if supplied with inputs deterministically, returns outputs.
Armed with that equation, then, an IT operations professional who observes the inputs can ‘predict’ the outputs. Remember, the equation has been brought into existence by automation of parameter selection. The Nostradamus factor comes into play when the equation returns outputs on the basis of inputs derived from time stamps. In such a case, the IT Operations professional may observe an input of CPU usage and a timestamp of 2:00AM. On the basis of the equation, the prediction is that end users will experience a degradation of response time in 15 minutes.
Bayesian. The other approach to classical statistical methodology is Bayesian. It starts with a parameter already selected. There already exists an equation put forward as potentially able to describe any data sets to be presented. Of course, in the absence of any actual data, the Bayesian statistician can only state a probability with which a given parametrically determined equation can accurately describes the world. As the statistician actually observes data, then, the parameter selections are likely to require updating. The probability with which the new equation holds of any future observed data sets will likewise be modified. The update rules are themselves complex but can in many cases, ultimately, be automated.
The problem with predictive analytics
There has been a lot of ink spilled on philosophical arguments between Frequentists and Bayesians. Indeed, despite sharing the core mathematics of probability, the two approaches can lead to very different conclusions. Furthermore, the machine learning and AI community has largely sided with the Bayesians, although not dogmatically.
Nonetheless, in both cases, the end point is an equation. This is in either case - whether a parameter value is selected, or a previous parameter value selection gradually evolves until no further adjustment is required. Prediction is, then, affected by observing inputs and predicting what the equation outputs. AI enters via the automation of an episode of applying (frequentist or Bayesian) statistical methodology.
Three issues undermine this approach to predictive analytics
First, this approach relies on relatively clean data sets to get off the ground. Both Frequentist and Bayesian versions tolerate some noise. The Bayesian approach is somewhat more robust in the face of noise than the Frequentist. Neither can function at all well once noise levels reach the 10 to 20 per cent level. Unfortunately, in modern environments the noise levels associated with the data sets an IT Operations professional needs to work with are in the 90 per cent plus range.
Second, even if the data sets magically delivered to the automated statistician were sufficiently pure, the derived equation itself is based upon historical data. Ironically, its ability to help an IT Operations professional predict the future depends heavily upon IT system behaviour remaining much as it always has been.
Third, and perhaps most problematically, the equations generated only indicate at best correlations among the data items observed. This is true even if the data were pure. This is true even if IT systems and user interactions with those systems remain largely unchanged. Hence, in so far as the equations generated describe the world behind that data, they only describe how events in that world correlate with one another.
Now, while one may just want to anticipate what is going to happen, IT Operations teams usually want to be able to do something to prevent user experience or business process impacting events from happening. Just knowing that something will happen, even with high probability, is not sufficient.
The equations must be supplemented with further information that allows an observer to figure out which events are happening or will happen before the impacting event. IT Ops practitioners must address these triggers in order to prevent the impacting incident from occurring. Put another way, predictive analytics has been at best correlational when it needs to be causal instead. As an IT Ops professional, you don’t just want to know that response times will degrade. You want to be able to take steps to prevent that degradation from taking place.
Will Cappelli, CTO EMEA and Global VP of Product Strategy, Moogsoft