Fat fingers and monkeys: How to plan for and recover from IT emergencies

Bugs in the software, mistakes in configuration files, even the infamous “fat finger” - all are responsible for service outages at cloud-based services, data centers, enterprise networks, and any other IT installation, large or small.

There are just too many things that can go wrong. Even if an organisation takes care of everything humanly possible, there are still the monkeys. That's a lesson the folks at KenGen, Kenya's electric company, have learned the hard way. 

After an investigation of a nationwide blackout on June 7th that knocked power out for several hours, the company announced that “a monkey climbed on the roof of Gitaru Power Station and dropped onto a transformer tripping it. This caused other machines at the power station to trip on overload resulting in a loss of more than 180MW from this plant which triggered a national power blackout.”

Monkey-based damage is rare, of course, but other forms of damage aren't. According to backup and disaster recovery firm Quorum, for example, three quarters of downtime is caused by either hardware or software failures, and by human error. We may not be able to easily prevent monkey errors, but surely there is something companies can do to remediate, recover, and perhaps even prevent human error-based outages in the first place, as well as the outages caused by bugs, software failures, and even “unknown” causes.

Natural disasters like hurricanes and earthquakes may get all the PR in cases of major service outages, when networks, banking sites, cloud-based services and more become unavailable. But according to numerous studies, the real culprits behind service outages are much more mundane, and controllable. Quorum, for example, asserts that three quarters of downtime is caused by either hardware or software failures – and by human error. It's that latter category that many in the IT world are most concerned with.

Are there that many “incompetent” IT workers running networks? Are typing errors in configuration files, failure to patch, and even “fat finger” - some of the hallmarks of what is blithely termed “human error” - that prevalent? And if so, what can companies do to remediate, recover, and perhaps even prevent human error-based outages in the first place? Before the barrage of angry e-mails and threats of lawsuits begin pouring in, IT managers need to develop a coping strategy to handle the inevitable. Here are some ideas how.

Coordinate

Once an outage has been detected, the response team needs to be ready to act, and that means it needs a leader who can assign remediation tasks to team members. Along with that, there needs to be an awareness of the interconnectivity of systems. Team members need to be aware that any actions they take are liable, and likely, to affect many other users, in a chain reaction that has the potential to further disrupt work patterns.

Communicate 

In the midst of an IT emergency, there is a tendency for workers to hunker down and focus on the horror unfolding before them, often ignoring the angry e-mails, service calls, and Twitter messages directed at them. Obviously, all resources have to be put into fixing things, but that doesn't mean that a company has to “go dark” in the wake of a crisis.

Just the opposite; acknowledging there is a problem and telling clients/customers about how it is being dealt with could mitigate the bad feelings generated by an outage. While clients/customers might still be angry over the outage, they may be more amenable to accepting the inevitable post-crisis apology if they see the department/firm hasn't abandoned them.

Educate

When the crisis is finally brought under control, the finger-pointing begins, and much time and energy will be spent on figuring out what happened, writing up reports about what happened, presenting the reports on what happened to executives, justifying/explaining what happened to the Board, etc.

While much of that is inevitable, unfortunately, prepared managers will turn the incident into a learning opportunity and put in place systems and processes to make such events a one-time affair.

Prevent

While it’s important to be prepared to deal with an outage, preventing it all together is what IT leaders should strive for in the first place. Doing the same thing and expecting different results will not get you ahead; in order to be fully prepared, a new approach and some outside help may be required. Systems that can analyse an IT ecosystem and determine the interaction between different components – hardware, software, configuration files, upgrades, and anything else that a network includes – can save a lot of time, effort, and heartache later on.

Automated IT operation analytics systems can provide information about potential problems in advance, such as whether a specific software upgrade may rewrite configuration files that could disrupt service performance, what the implications of adding a new replication server are, and much more. The system sends out an alert as soon as a risk is detected, enabling a rollback or other mitigation if needed.

By alerting staff of the potential problems before they actually happen, automated IT operation analytics can prevent outages altogether, unless a monkey gets into the works. At the very least, such systems will eliminate “unknown” as a failure category.

Yaniv Valik, VP Product Management & Customer Success at Continuity Software 

Image Credit: alphaspirit / Shutterstock