You don’t need me to tell you that good monitoring and alerting is crucial to data centre efficiency. An effective monitoring and alert combination can ensure that IT professionals enjoy well-informed interactions with the technologies they manage—ultimately making their day much smoother.
However, good monitoring doesn’t come down to simply having the latest technology at your disposal, nor is it even really improved by the greatest technique. Monitoring, like many other things in life, is most influenced by mindset. This is because every aspect of monitoring comes down to choice. It’s a choice to determine when, how, and where we gather data from our environments, and we then have to choose what to do with it once we have it.
With the right mindset, monitoring can transform your data centre for the better, and, conversely, implementing a monitoring solution with the wrong mindset can disrupt your systems and frequently ruin your day (not to mention your sleep habits).
You can probably tell by now that good data centre monitoring and alerting is a particular passion of mine—I even wrote a (free) eBook on it (yes, I did just do that). So, when I see examples of bad monitors and alerts arise, I am naturally inclined to demonise them all in a well-intentioned blog post.
Do you know what’s wrong with an alert that triggers when CPU utilisation exceeds 90%? Everything. High CPU is perhaps the most commonly used monitoring and alert combination, but ultimately it sheds no light on what’s going wrong, or even whether anything is wrong in the first place.
In fact, quite often, high CPU is actually simply proof that the system is keeping up with demand and is the correct size for the workload. However, believe it or not, this alert can be amended to become useful.
The key is to firstly collect three pieces of data:
- The CPU utilisation (CPU_UTIL).
- The number of CPUs (or cores) in the system (CPU_COUNT).
- The number of jobs waiting to be processed (CPU_QUEUE).
With these gathered, the alert should be triggered when (CPU_QUEUE > CPU_COUNT) AND (CPU_UTIL > x%) for more than y minutes.
If the alert is triggered, then you can be sure that your machine is failing to meet the demand of the workload. In this instance, there is either a fault in one of your processes or your hardware has become too old to cope.
Solely monitoring bandwidth will, again, tell you nothing about what’s going wrong, or even whether anything is wrong in the first place (there’s a developing trend here). And, again, I hate to be repetitive, but a high bandwidth utilisation without any other negative indicators is generally proof that you have the right amount of bandwidth to match your needs.
To make bandwidth alerting actually useful, you simply need to add in response time. This is because if your bandwidth is above a certain percentage and your response time through the interface is high, you have a bottleneck in the system.
In addition to this, it is always worth noting that NetFlow is a great tool to discover how your bandwidth is being used. The above-mentioned process wouldn’t alert an instance when the bandwidth utilisation is at a normal level and the response time is slow, but there could still be an underlying issue. NetFlow will allow you to determine if the combination of a slow response time and the specific uses of the bandwidth are worth worrying about.
Monitoring CPU utilisation in virtual environments
Monitoring a virtual machine for CPU utilisation is problematic, to say the least. This is because the virtual machine could be of the belief that the CPU utilisation is high, but, in reality, the physical resource may not be anywhere near capacity. Equally, the virtual machine may be claiming that the CPU utilisation is low, but there may be an issue with another VM on the same host—a “noisy neighbour” that is resulting in suboptimal capacity. You get my drift?
To correct this alert, you will first need to determine CPU Ready Time, or RDY percentage. This is the condition in which the virtual machine has work to do, but must wait for the hypervisor to schedule that work on one or more of the physical CPUs. This typically occurs when a physical host is oversubscribed with too many virtual machines.
The second piece of data to determine is Co-Stop. This is the amount of time an SMP virtual machine was ready to run but was delayed due to a co-virtual CPU (vCPU) scheduling issue. In a multi-vCPU virtual machine, Co-Stop indicates either, a) the amount of additional time after the first vCPU is available until other vCPUs are ready for the job that needs processing, or b) any time the vCPU is stopped because of scheduling issues.
With both pieces of data identified, a useful formula to determine your alert threshold would look for the CPU Ready Time > (10% * <vCPU count>) or Co-Stop > 3% for an extended period of time.
Alerting based on “Top 10 Queries by CPU”
Finally, this list would not be complete without “Top 10 Queries by CPU,” an alert that, like the preceding three, is almost entirely useless. But the unfortunate thing is that this alerting method is also very commonly used.
In order to find what you are looking for, forget “top queries” and shift your focus to the queries that encounter the largest amount of wait in a given period of time. With these determined, it will be much easier to identify and resolve any database performance issues.
You (don’t) have to read this bit
The examples of bad monitoring and alerting I have illustrated here span across varying areas of IT. My point being that monitoring with the wrong mindset can negatively impact any and every aspect of your environment.
It goes without saying that the right mindset when implementing monitors and alerts can significantly benefit all aspects of your environment. With any luck, at least one of my examples will provide you with the inspiration to take another look at your monitoring approach and assess whether it can be tweaked for the better.
Leon Adato, Head Geek and Technical Product Marketing Manager, SolarWinds
Image source: Shutterstock/alexskopje