The handy guide to data centre monitoring alert automation that you (potentially) didn’t know you needed

I’m going to let you in on a secret: there is no secret. Let me explain; there is no “secret IT pro handshake” or “secret IT society” that keeps information under lock and key from mere mortals. Those of us who have worked in the IT business for years (or even decades) are not only willing to share—we’re usually excited and eager to share our hard-won knowledge with colleagues from other parts of the business, should they express the slightest inclination to learn. Nowhere, I think, is that more true than in the IT sub-discipline of monitoring and automation. When you say “monitoring,” most people think “tickets,” followed closely by “annoying and frequently useless tickets that interrupt my day.”

So, you want to know a not-so-secret secret? In my best Morpheus baritone voice, I would say, “What if I told you monitoring doesn’t have to create a ticket?” Monitoring can lead to all sorts of interesting outputs, none of which are more of a game-changer than automation. 

Automation, when applied to alerts, can be a key time-saver for IT professionals, and has the potential to significantly increase the output and overall effectiveness of your data centre. The point of automation, as it applies to monitoring, is that the system that detects a problem is in the best position to automatically respond to that problem in near real-time. At best, it resolves the issue, or (at worst) it gathers additional insight about the problem at the time it occurred, so that the person who is responding has better insight in order to begin effectively troubleshooting. 

Read on for handpicked alert response examples from my “garden of automation.” These alert responses have blossomed in their environments and could become a life hack for you, too. 

Restart an IIS™ application pool

Restarting application pools is often the easiest and best fix for website-related issues. Please note my conscious use of best rather than quickest, as this will frequently be the case. However, automatically restarting the application pool becomes slightly more complicated when considering that one server could be running multiple websites, which then in turn have multiple application pools, or vice versa. Unfortunately, you will have no way of determining how the server and websites were first configured. Nonetheless, if you can access the application pool name, you can do the following:

  • Press the built-in restart application pool option in your monitoring solution.
  • Roll-out this command from the command line of the affected server: appcmd [stop/start] apppool /apppool.name: <app pool name goes here>. Keep in mind that appcmd.exe may not be in the path. You can typically find it in C:\windows\system32\inetd\appcmd.exe. Also note that appcmd.exe can’t run against a remote system, so you will have to use another utility, such as psexec, to run it from the monitoring server against a remote machine.
  • Run a PowerShell® script locally or remotely. The code would look like the below:

# Load IIS module:

Import-Module WebAdministration

# Set a name of the site we want to recycle the pool for:

$site = "Default Web Site"

# Get pool name by the site name:

$pool = (Get-Item ‘IIS:\Sites\$site’| Select-Object applicationPool).applicationPool

# Recycle the application pool:

Restart-WebAppPool $pool

Disk-full

Disk-full alerting may be considered by most to be a relatively simple concept, but it is important to bear in mind its many separate components by following the below steps:

  • Identify the most suitable alert for your requirements. You may find that the best method is to include logic in your alert that will test for the total remaining space on the drive. Regardless of the alert that you choose, the most important factor is to ensure that you are monitoring disk space in a way that is relative to volume.
  • Clear any unnecessary disk files out of various directories. Be cautious that in order to do so, you may be required to impersonate a privileged user account, as many monitoring solutions run on the server as the system account. There are many ways this can be done, all largely dependent on your individual environment, so I’ll leave you to figure out the rest. 
  • Determine that the correct directories for the specific server are being targeted. I have found that the best approach for this is to place a script file in a common shared folder that maps to all servers. The script can first be set up to identify the proper directories, and then proceed to clear them out (with all the necessary precautions in place, of course).

Restart IIS

Make no mistake: resetting the IIS is a hardcore website fix. No matter how drastic this option may be, it is at times necessary. There are many ways in which you can restart the IIS web server:

  • Use the restart IIS option in your monitoring solution
  • Execute iisreset/restart at the local command line of the affected system
  • Remotely execute iisreset <computername>/restart
  • Create and execute a PowerShell command, such as invoke-command -scriptblock {iisreset}
  • Or, more simply, use the call operator & {iisreset}

Restart a server

If restarting the IIS is considered hardcore, then restarting the entire server is truly a backs-to-the-wall, no holds barred course of action. Nevertheless, there will again be times when this is your most suitable option. You can do this in many ways:

  • Use the restart server action that is built into most monitoring solutions
  • For Linux®, issue the command ssh -l <username> <computername> ‘shutdown -r now’
  • In Windows®, you can remotely restart a machine by issuing the command shutdown /r /f /t 0 /m \\<machinename> /c <comment to add to eventlog>
  • Using PowerShell, you can do it with restart-computer <computername>

Restart a service

As a data centre professional, a service deciding that it no longer wants to carry on working may present you with some slight issues. If you find yourself in this situation, restarting the service is an option. Sometimes this won’t do a great deal, but then again, sometimes it will. It would of course be absurd to ask a computer to perform in a logical manner. If you do find yourself in this situation, the following may help:

  • Use the restart service action that is built into most monitoring solutions
  • Issue the command net start <servicename> on the local computer
  • Issue the command sc <computer> start <servicename> from a remote machine
  • Run a PowerShell script with the following commands: (get-service -ComputerName <computername> -Name <servicename>).Start()
  • For Linux systems, run the pkill command, either locally (pkill -9 <process name>) or remotely (ssh -l <username> <computername> ‘pkill -9 <process name>’)

Backup a network-device configuration

I have shown automation to be a direct remedy in my above-mentioned examples; it can also be used to gather valuable forensic information that can help troubleshoot the issue. Network-device configurations are a good example of this approach. This method will not fix the issue, but it will give you the ability to pull a device configuration based on an event trigger. If necessary, it may also give you the option to return to the last-known-good configuration. You can use the following approach to do so:

  • Copy the config with built-in functions in your monitoring solution
  • Copy the config with PowerShell:

New-SshSession <device_IP> -Username <username> -Password “<password>”

$Results = Invoke-Sshcommand -InvokeOnAll -Command “show run” | Out-File

“<filepath and filename>”

Remove-SshSession -RemoveAll

Good automation is enabled by, and is a result of, good monitoring. When done correctly, it is simple. For more cuttings from my “garden of automation,” check out my eBook, "Automation, Not Art."

Leon Adato, Head Geek™, SolarWinds
Image source: Shutterstock/Scanrail1