Skip to main content

Database monitoring: get it right or risk your costs rising

(Image credit: Image Credit: Geralt / Pixabay)

With almost every business today having to manage data in some shape or form, databases are becoming the backbone of modern society. Keeping them in top condition is crucial to ensure that systems run smoothly, data is stored and retrieved easily, and – most importantly – there is no downtime. Many businesses are starting to rely on database performance monitoring tools, or alternatives, to ensure their databases function as expected. There are many tools and methods to gain visibility over data estates, and most appear to be very comprehensive and authoritative. But ultimately, are they providing businesses the security that everything is running as it should, or are they putting on an act?

To understand how your method of monitoring is functioning, ask yourself the following questions:

  • Is it contributing to more issues than it’s solving?
  • Is it providing enough detail, and the right details, to help you quickly resolve and prevent issues?
  • When something fails, is it supported by experienced engineers?
  • Can it scale with your data estate’s expected growth?
  • Is it operating in every environment you need it to be?

Any issues with your database could lead to downtime, or at the very least, slowtime, neither of which are desirable for businesses. According to Gartner, the average cost of IT downtime is $5600 per minute. Though this is very much based on the size and nature of the company, even small companies can see costs approaching $100,000 per hour of downtime, with larger companies seeing those costs in excess of $1 million per hour. Slowtime, while estimated at one fifth the cost per hour of downtime, tends to occur ten times as often.

These costs can add up very quickly, so being quick to identify, determine root cause, resolve, and ultimately avoid these incidents has tremendous value to any company. This is where monitoring tools can come in handy, but businesses do not always use them effectively. Without meaningful processes in place, businesses can risk spending more money between the solution they think is working, and the continued downtime/slowtime that is still occurring. So what options are most common, and what do IT leaders need to watch out for?

1. Application Performance Monitoring (APM) tools

Today there is a wealth of tools on the market that can provide good quality, simple application performance monitoring, which share visibility of the general health of IT environments. However studies around these tools come to similar conclusions when it comes to their effectiveness. The majority of respondents say the most common source of application performance issues is the data platform. In terms of the effectiveness of APM tools, most respondents say they point you in the right direction, but rarely identify the root cause of those data platform issues. Ultimately, additional manual gathering of data through other means is necessary to troubleshoot and resolve the issue.

This creates incomplete solutions and a much longer time to root cause analysis, as well as more difficult long-term optimisation to mitigate similar issues in the future. APM tools are good at providing a broad view across a network, but don’t often give the depth of visibility into the data platform that is typically necessary to get to the root of performance issues.

2. Custom scripts

Database administrators (DBAs) that have worked in the industry long enough will likely have a collection of custom scripts that they’ve either discovered online or have created themselves. These scripts are often used to augment another tool such as APM, or by themselves as specific needs arise.

There are several limitations when relying on a library of scripts for long term use. It’s rare that they provide a complete picture of an IT environment – often these scripts are fit for a specific purpose and once that challenge has been overcome, they may rarely provide value again. Those that do provide longer term value often become difficult to maintain as environments grow and evolve, and technology changes. Maintaining them can also become a full-time job in itself, and given the low likelihood that they’ll deliver the granularity and/or significant historical detail needed to easily find that root cause and prevent future occurrences, they can become very time-consuming.

3. Wait stats

A ‘resource wait’ is accumulated by processes running on SQL Server which are waiting for a specific resource to become available. Wait stats therefore highlight where the most pressing bottlenecks are building within SQL Server. Some IT leaders can be tempted to simply focus on wait stats to understand the performance of their databases, with the attitude that “if I know what waits are associated with my queries, I know exactly what’s wrong and I don’t need to worry about all the other information available.”

Wait stats at the server level are a great place to start to get a feel for the performance profile on your server and where issues may be experienced. However it’s just that, a great start. Like the other methods we mentioned, ultimately you find yourself in need of more information from other sources to get the complete picture and determine root cause. What’s worse, focusing on wait stats – especially at the query level – can lead you to the wrong conclusions altogether. It’s similar to focusing on one car in a traffic jam. The car is running perfectly and should be moving, but the wait stats don’t reveal that there’s a truck up ahead that needs to turn around. It’s moving, so doesn’t report any issues, but the car that’s waiting has no insight into this.

4. Database Performance Monitoring (DPM) tools

DPM tools are of course, in general, the most effective way to monitor database performance, because that is exactly what they are designed to do. However, even within this, they can be less effective if they aren’t used to their maximum potential.

A lack of detail can be an issue of DPM tools, particularly with counter-based metrics like CPU, IO, etc. Some products and home-built solutions only capture snapshots of this data once in several minutes. Often this is due to the arduous method of collection where you don’t risk collecting with any higher frequency without risk of over-burdening the monitored server, typically due to inefficient means of collection. Other limitations involve query level details, where only the Top N queries – usually queries for the most recent or “best” entries of a result set – are ever collected or shown, regardless the level of activity on the server. Others again focus on queries based on their own waits, as opposed to the actual resource consumption of the request where you are much more likely to identify the root cause.

Scalability is also often a challenge. Most DPM tools are limited when it comes to how many servers they can monitor with a single product installation. The primary bottleneck comes from the fact that they are all backed by a SQL Server database to store all the data they collect. Because of this, these products start to struggle somewhere around 200-300 monitored SQL Servers. For larger enterprises to monitor more than this, it may be necessary to deploy multiple installations to cover the entire enterprise. Some DPM products handle this by supporting multiple back-end databases from a single interface, though this requires significant cost and administrative overhead.

People are important

No matter what tools or methods a business implements, data estates can always be complicated, even in smaller, simpler environments. That’s why it’s vital, regardless of the tool, that it’s backed by responsive, expert support engineers to ensure that the tool is running optimally, and you are getting the most value from it. Especially when downtime occurs unexpectedly, business leaders want to know they’ve got an expert available that can help get them on track.

The bottom line is simple – downtime and slowtime in the data layer can both cost an organisation huge amounts of money if they’re not resolved quickly. But just having one tool or method in place and assuming it will work wonders without any effort is ineffective and can cost more money in the long term. Having a process that is set up optimally and used correctly can mean the difference between costing a business a fortune during its next outage, and mitigating those costs or preventing those critical and high priority events from occurring in the first place.

Steven Wright, Director of Sales Enablement, SentryOne