Avoiding a Delta-like disaster

Summertime is big money for airlines. Yet this past summer, two major airlines – Delta and Southwest – suffered IT outages which canceled thousands of flights, disgruntled hundreds of thousands of customers and resulted in the loss of untold millions of dollars.

A press release by Southwest revealed an estimated loss of between $54 million in $82 million. Delta reported it lost $100 million in revenues due to the outage. Ouch.

When a large company has a major IT outage, the pundits come out of the woodwork – quick to analyse, criticise and offer notes of warning to others. It’s easy to sit back and wonder: how could they have messed up so badly? The truth is, outages can happen to any company and sometimes they are virtually impossible to prevent. You may have planned for scenario A and scenario B, but not for both areas to fail simultaneously. Doing whatever you can to prevent outages is certainly important.

Equally or even more critical is creating sound disaster recovery processes so that when an outage occurs, your company is back up and running with minimal disruptions to customers. Delta and Southwest, on the other hand, took many hours to restore their networks and systems to normal operations. To set the tone, it is entirely possible for any company in any industry to create an environment whereby the most important systems are immediately back online after an outage.

This is not just a strategy for the largest enterprises with the largest purse strings. Technology is becoming more efficient and effective all the time for resiliency. Companies should focus business continuity efforts on two fronts: bolstering IT infrastructure to resist an outage, and creating a disaster recovery plan that gets your business-critical systems back online in a few minutes.

Follow these core strategies for outage prevention and recovery, and you’ll be ahead of most if not all of your competitors.

Install a culture of rigour

Beliefs and perceptions from the IT manager up to the CIO and even CEO will make or break the best plan on the planet. Understand where your company’s leaders sit on tolerance for disruptions. If the tolerance is extremely low—as it usually is—IT leaders will need to work hard educating the senior leadership team on the budget and time required to create and maintain the kind of watertight environment to support that SLA.

Business leaders should help set the tone for expectations for their IT department, to be reinforced by the CIO and his direct reports. With everyone on board, you can move forward confidently in the next steps.

Do the math on downtime costs and systems

Every business should have a good idea of how much one minute, one hour and one workday will cost the business in revenues and customer dissatisfaction. (Don’t forget the debilitating affect of social media ire). This analysis will help develop the framework as to what protection is needed and where, on networks, servers, storage and applications.

Next, catalogue all business applications in terms of uptime requirements. Every company has a core set of systems that must always be up and running to keep the business viable—such as ecommerce software, the web site and call centre technology at a retailer. Other applications might withstand a 12-24 hour outage with minimal impact on revenues and customers. This approach makes for a reasonable and more tolerable plan for IT spending.

Evaluate infrastructure health

Now take those critical systems which must always be live, and assess their potential for breakability. Keep in mind that an environment with a high percentage of aging systems will create a domino effect when one component goes down. The Southwest Airlines outage began with the failure of one Cisco router, and escalated quickly from there.

For large companies with multiple revenue-driving projects on IT’s agenda, it’s hard to upgrade and replace systems and equipment which appear to be working just fine. That mindset is, of course, what leads to many of the big outages that cost millions of dollars in lost revenues, brand damage, recovery and even litigation.

As well, many of the latest virtualisation tools and technologies require modernised IT environments. Prioritise by business risk the infrastructure areas that need help and upgrade or fix those areas first. Again, this is where the cultural alignment outlined above will come to play, as top execs will need to understand the need and approve budgets accordingly.

Build for redundancy

With the knowledge of which business systems require immediate recovery after an outage, you can create a DR plan to match. The goal is to ensure that when disaster strikes, a skeleton of your company will be available in minutes if not seconds.

It’s clear that Delta and Southwest didn’t have the short window of failover to get those workloads back up and running to keep the planes flying—or else, the company didn’t test it well enough. (more on that below).

Consider that redundant environments can serve a purpose other than sitting idle waiting for an incident to occur, such as for dev and test. Technologies and methods to consider include virtual machine redundancy, server and storage virtualisation and SAN replication.

Hosting strategy

Many companies are moving pieces of their environment to the cloud, whether that’s SaaS, IaaS or PaaS; this is also an excellent DR strategy. At a high level, this places the burden of uptime and fail-proof infrastructure on the service provider. With virtual technology and the distributed nature of cloud infrastructure, your systems have redundancy built in from the beginning.

If you don’t have the flexibility nor budget to make your systems fault tolerant, move them to a more elastic cloud platform. The cloud can help companies chip away at their risk surface faster and usually more economically than upgrading the internal data center or building another DR site.

Testing and monitoring

Most large companies have dozens of systems generating data and alerts about system health. Unfortunately too many inputs is just noise, not intelligence. The primary goal is to identify single points of failure and what action is needed to fix them. IT managers should consider how to simplify the monitoring environment and to prioritise based on KPIs and critical alerts.

This might mean consolidating systems and/or using outside services to handle the monitoring and management for you. Regular testing of the DR environment is also paramount. Delta and Southwest likely knew there were issues—but their DR systems couldn’t recover in time. Even when a DR environment sits idle, it should be managed with as much rigor as the live production environment.

Create checks and balances

CXOs don’t want to think about outages, but they expect CIOs and CTOs to be on top of the risk. It’s important, regardless, for IT execs to keep the airways open with their business counterparts, and the same goes for IT department managers keeping in touch with their bosses.

Executives may understand at a high level the gaps, but that’s not enough. They also need the granular data to see exactly where the risks lie, how they relate to potential business losses, and what eliminating the risk involves. Execs may assume that the minimum protection is good enough return to business as usual in 24 hours or less.

Or, they’ll approve a slow cadence to upgrade infrastructure and expand the DR systems, when a more aggressive approach is what’s really needed. Only a thorough DR test will prove if what IT has in place meets corporate expectations.

Business process flexibility

IT isn’t solely responsible for business continuity. Business leaders should consider how and if their core processes can adapt for faster recovery. Airlines, for instance, operate under strict scheduling policies that make it hard to reset and move on. Instead of cancelling the next hour of flights when an outage occurs and picking up the schedule from there, once the schedule starts to go off rails, the house of cards falls down with a full day of flights off the board.  Customers get caught up in this misery, stuck in airports for hours on end and missing important personal or business events.

There is no one-size-fits-all strategy for outage preparedness. Many factors come to play, including leadership and cultural characteristics, risk tolerance, customer expectations, business risk and budget flexibility.

Yet the fact remains: if you know you’ve done everything possible to prevent an outage and recover from it quickly, the likelihood of your business falling down when disaster hits becomes negligible. That should help any executive sleep a little better at night.

Karl Reeves, VP of Operations at Digital Fortress

Image Credit: alphaspirit / Shutterstock