Skip to main content

Putting automation at the heart of the future of AIOps

(Image credit: Image Credit: MNBB Studio / Shutterstock)

Most AIOps technology is focused on better understanding what is taking place in the IT environment and drawing attention to current or future problems. Some products go further and provide explicit remedies for sorting such problems, and some go even further and provide links to process automation platforms which execute remediations based upon recipes.

Many IT professionals worry about whether humans can intervene during the automated formulation of recipes and the execution of remedial tasks. Most of the concern arises from lack of trust or comfort with the often esoteric observations and analyses originally generated by the AIOps platform.

The inevitable closing of the loop

There are two major problems with the general state of affairs. First, the key reason why AIOps technology has been deployed for observation and analysis is because the phenomena - the manifestations of IT system behavior - have become too complex for humans to grasp. On their own, human beings cannot determine that something of significance is taking place in today’s IT environments until massive damage has already occurred - let alone be able to determine what it is that has taken place or why it has taken place. So, if the capacity of human IT operation professionals has been maxed out, why should a business trust human judgement more than the judgement of the AIOps technology?

Second, the possibility of human intervention is meaningful only if events occur in time scales that are ‘just the right size’ for human intervention. If the time scales are too large - say years or even months - the effectiveness of human intervention is limited. If the time scales are too small, say seconds, human intervention just can’t take place. Given these issues, we can hazard a guess that over the next five years or so, as IT systems become more complex and business process tasks execute in ever smaller time scales, concerns about the need for possible human intervention will fall by the wayside. Questions of trust in AIOps will become moot, simply because enterprises will have no alternative but to trust it.

So ‘closing the loop’ is an inevitability. And, indeed, many AIOps vendors today are trying to beef up the automation credentials of their platforms with encouraging results by end users. For example, a leading financial services company has an “Automation Through Moogsoft” policy. If any automation is to be performed, then it needs to be orchestrated via Moogsoft. The result? In one month, it saved over 10,000 man hours via automation. Another case in point is a global wealth management and investment bank which has automated 60 percent of all incidents. Outside of its country of origin, this number is 84 percent.

But there is another point which needs to be kept in mind. Closing the loop should not be looked at as a completion or extension of AIOps. Instead it should be seen as a fundamental transformation of the field and its associated technology.

How will AIOps change?

So how will AIOps with automation at its heart be different? One can describe current AIOps platforms in terms of a basic workflow. Data streams in from an IT environment. Significant elements of data from that stream are selected. Algorithms seek to correlate those significant elements and, then, determine which of those correlations indicate a relationship of causality.

Once all of this analysis is done, the result is passed onto an environment where humans and, maybe, bots congregate and collaborate to resolve what has been revealed, using the causal information derived during the last stage of the analysis. The action required to affect the resolution may be carried out either by human beings or an automation platform of one sort or another. Although undoubtedly valuable because it significantly reduces MTTD and MTTR, AIOps is a rather passive, reactive affair.

Observation is prediction

In the future, however, AIOps will take a far more active role. Instead of simply receiving data from the data environment and then working on it, AIOps platforms will approach the incoming data stream with a hypothesis about what it should contain. Like a scientist performing an experiment, the AIOps platform will have formulated a scenario which makes predictions about IT environment behaviors. The model then will be tested through a comparison between the actual incoming data sets and what the model expects.

Now suppose the model makes a poor prediction and the data set that arrives is not what is expected. Let’s assume, for example, the AIOps platform predicts that a given storage medium will be on average 75 percent filled. In reality, however, it turns out that the average is around 90 percent. At that point, the AIOps platform will have to make a critical decision.

Is the model incorrect and, hence, requires revision? Or is there something wrong with the data itself? Is the data generation process itself at fault? Note that whenever data is dismissed as noise, that is precisely the decision that has been made by the observer. It is not the model. It is the data that is at fault. The proper reaction at that point, then, is to modify something in the data generation process -- typically under the guidance of the model being tested -- in the hope that better data is generated.

However, the conflict between model and data can mean something far more profound. It could indicate that there is a conflict between a desired state of the IT system (say its readiness to support a new business application) and the reality. Facing this kind of conflict, the AIOps system should enact a more profound modification of the environment to ensure that the environment is indeed in its desired state.

Noise and strategic failure

So how does an AIOps platform decide whether the model or the environment requires correction? If the latter, how far should that correction go? To answer this question, let us slightly modify our picture.

Rather than thinking in terms of a single model being tested against the data, think in terms of a hierarchy of models. At the lowest level there is the original incoming stream of data. At the second level, there is a model that is very specific and confined to very specific times and locations. At the third level, there is another, more abstract and more broadly applicable model that treats the model on the second level, if verified by the data on first level, as the data that it needs to be concerned with. This continues on up through an indeterminate number of levels.

For example, if you have an ERP application which supports an end-to-end order management process, you would have a model that would predict the entire end-to-end latency of the process. This is level 1. A level 2 model would predict the average response time that a user of that application experiences while he/she is performing the task assigned to them in the overall process.

At the very top, however, there is what is meant to be a universally applicable model. No matter when or where or what circumstances, that model is meant to hold. Now, if we are thinking in terms of science, we might think of such an uppermost model as something like the fundamental laws of physics or, if we are of a mathematical bent, the axioms of set theory. When talking about a designed system, however, an IT environment intended to achieve a set of business goals, that uppermost model is basically nothing more or nothing less than the IT business strategy itself.

For example, the model may consist of statements like ‘all IT systems must perform in such a way that customers complete 90 percent of the transactions they initiate on enterprise websites.’ In other words, should that model confront data that conflicts with it, it is always the data that is at fault.

Performance and decision making

So, with these thoughts in mind, let us try to tackle that decision procedure. A model that has been verified in the past stands a better chance of being correct than one that has not. So, for any given model, at any level of the hierarchy, points are accumulated based upon past performance.

On the other hand, points are accumulated also on the basis of how high that model sits among the levels. If a model is falsified but sits very high among the levels, then there is a good chance that there is something wrong with the environment and the environment requires changing. In other words, the decision between changing the model and changing the world will be made based upon a threshold determined by a combination of past performance and hierarchical level.

Closing thoughts

In summary, the increasing complexity and velocity of business will make enterprises rely ever more heavily on AI. This AI, however, will not be confined to observation and analysis of the enterprise environment. It will, instead, take up a central role in performing the actions that together make up an effective digital business process.

Will Cappelli, Field CTO of Moogsoft