Navigating IT’s growing “Alert Fatigue” epidemic

For modern IT teams, virtual environments have rapidly become the core of many data centers - running the company’s most important business applications. Unfortunately, this trend has outpaced the conventional approach to data center management and optimization. Virtual IT environments are more complex than physical environments, but most IT teams still use legacy "siloed" tools to manage them. That is, the virtual environments are divided into separate silos of network, application, compute and storage. In a virtual environment where virtual components share physical resources across silos, this strategy leaves IT with a fragmented, incomplete view of what's actually causing performance problems.   

Compounding this issue is the fact that each silo uses a different collection of monitoring and diagnostic tools, making it even more difficult to find the root cause of application performance issues. According to a recent survey of IT pros, only 22 percent are using a single tool whereas 68 percent need 2-4 tools to find the root cause of application performance issues. With this approach, IT has to sift through thousands of alerts, compare analyses from multiple tools and draw on their own knowledge and experience to optimize and solve problems. Having to navigate through all of these alerts gives way to “alert fatigue” - with so many meaningless alerts, it’s increasingly difficult to pinpoint which are worth diagnosing and what could have the most detrimental impact on application performance.   

As a result, IT wastes valuable time sifting through alerts, often solving the same problems repeatedly. Many IT workers have inadvertently become a reactive firefighting organization that doesn’t have time to implement the forward-thinking IT programs and systems that add value to their business. Worse still, IT has no unified way to predict potential problems, forecast performance or capacity requirements, or develop strategies to avoid problems from developing in the first place.   

Separating issues from alerts 

Today, virtual environments are simply too complex and dynamic for humans to manage with siloed approaches. Traditional tools attempt to mitigate alert storms by presenting dashboards with multiple graphs and metrics, but without clear answers to key questions. IT has to do significant manual work to assemble, compare and analyze all of the relevant data needed to uncover the root cause of application performance issues. IT also has to rely heavily on their specialized knowledge and expertise to draw conclusions and create a plan for resolution.   

Traditional tools are limited by their threshold-based design. They require IT to set individual thresholds for each metric they want to measure – CPU utilization, memory utilization, network latency, etc. A single environment may need to set, monitor and continually tune thousands of individual thresholds. Every time the environment is changed, such as when a workload is moved or a new VM is created, the thresholds have to be readjusted. When a threshold is exceeded, these tools often fire off thousands of alerts for the same root cause issue, burying important information in alert storms with no recommended resolution. Because threshold-based tools only look at individual metrics in isolation, they cannot account for issues caused by interactions between related virtual components, such a “noisy neighbor” scenarios. Worse, they don’t analyze data in real time. Instead, they update periodically and present an average of the data collected in the interval. Short, periodic spikes in CPU utilization or another metric that indicates an impending application performance issue may never surface.   

IT teams using threshold-based tools have to review dashboards filled with charts, graphs and data summaries. While some of these tools may present a map of the infrastructure that shows the hierarchy of physical relationships between components, none give IT the insight they need to understand how the components interact and impact one another. IT is forced to draw their own conclusions about how to solve problems and optimize virtual environments. As a result, IT teams may spend days manually assembling events and alerts across all silos, trying and testing multiple solutions to solve and troubleshoot issues. According to a recent survey, 44 percent of IT pros typically need more than three hours to resolve an application performance issue and only 18 percent reported that the strategies they implement to resolve application performance issues are completely accurate. This troubleshooting process creates a huge drain on IT time and resources – and can negatively impact morale.   

How machine learning-based tools can help 

Machine learning-based analytics solutions in virtual IT environments make a vast improvement on the threshold-based tools by focusing on knowledge discovery, rather than merely reporting data or metrics. It’s a highly automated and adaptive technology that “learns” about the infrastructure and the interrelated behavior of its various components over time. Predictive analytics enable these tools to identify issues before they arise and recommend specific steps for eliminating them.   

These solutions analyze data from a wide variety of sources, across the silos, and learns the complex patterns of behavior between interrelated objects over time. Tools that use advanced machine learning and deep learning technology instantaneously identify the root cause of performance issues and provide recommendations for solving them with a level of precision and accuracy that humans alone cannot provide. IT gets accurate, specific information they need without alert storms or manual intervention. Instead of presenting IT with a variety of dashboards, these tools provide the answers they are trying to derive from the dashboards along with recommendations for solving problems, eliminating future issues and gaining a holistic view of their infrastructure. 

Understanding how IT resources impact one another 

The growth that we are seeing in the size and complexity of virtual data centers has pushed IT departments past the limits of traditional, manual approaches. To get their arms around managing these data centers, more and more companies must turn towards automated data science approaches.   

The use of machine learning and deep learning technologies to understand and manage virtual environments will become the norm. Automation of data center operations will be driven by machine learning to allow dynamic response to changing requirements. These advanced deep learning and machine learning analytics tools learn the patterns of behavior between interdependent components over time. As a result, they can automatically, and accurately identify behaviors between components that may indicate subtle problems that threshold-based tools cannot detect. More importantly, they automatically recommend the specific steps to resolve problems. 

By optimizing virtual data centers using machine learning-based analytics platforms, IT departments can turn their focus to adding value to their core business operations and end user productivity. They can add workloads, new applications and new technologies with a clear understanding of the impact of the change on their environment. Not only that, by understanding how IT resources impact one another, IT can better manage existing resources, reduce costs and improve efficiency across the entire organization.   

Jim Shocrylas, Director of Product Management, SIOS 

Image Credit: SFIO CRACHO / Shutterstock