Recently, harried road warriors have watched airline after airline suffer major IT failures that grind operations to a screeching halt, cancel flights, strand thousands of passengers and cost millions of dollars. For their part, the airlines and their executives seem surprised that the problem continues to arise, first at one airline and then at another.
Airline operations are incredibly complex, with one airline requiring the proper operation of up to 14 IT systems before pushing a flight back from its gate. Mergers and acquisitions have left many of these IT operations dependent on legacy systems that share data. As a result, software can be blamed for many of the problems.
Why then, are the airlines pointing to equipment problems or even facility personnel at their data centres when system outages take place? The British Airways outage in May 2017 and Delta Airlines outage in August 2016 are cases in point.
British Airways attributed the crisis to a power surge for an IT systems failure that left more than 75,000 passengers stranded in London. Further reporting by The Guardian pinned the blame on a single technician.
Similarly, the failure of a power control module at a Delta Airlines data centre caused hundreds of flight cancelations, inconvenienced thousands of customers, and cost the airline millions. Not surprisingly, neither airline provided much technical detail about the cause of the failures, leaving many questions unanswered in the aftermath.
Who or what is really to blame?
To date, Delta Airlines has provided some technical details about its failure, blaming a piece of infrastructure hardware. Some of the first reports blamed switchgear failure or a generator fire for the outage. Later reports suggested that critical services were housed on single-corded servers or that both cords of dual-corded servers were plugged into the same feed, which would explain why backup power failed to keep some critical services on line.
British Airways has offered even less detail so far, which is understandable, as the company’s first priority had to be its stranded passengers immediately following the incident. However, British Airways, too, seemed to blame equipment for a power loss that was followed by a surge. The Guardian report cited an internal British Airways memo that blamed a technician for mishandling an uninterruptible power supply (UPS), bypassing it at first and then bringing it on-line in an uncontrolled fashion.
In both instances, pointing to the failure of a single piece of equipment or operator error can be misleading. Highly dependent on mission-critical IT operations, airlines generally operate redundant IT infrastructure, and their facilities should’ve remained operational if they performed as designed.
How can these failures happen?
In short, a design flaw, construction error or change, or poor operations procedures set the stage for catastrophic failure.
Equipment like Delta’s power control module or UPS should be deployed in a redundant configuration to allow for maintenance or to support IT operation in the event of a fault. However, IT demand can grow over time so that the initial redundancy is compromised and each piece of equipment is overloaded when one piece fails or is taken off line. Similarly, mission-critical enterprises like British Airways deploy multiple engine generators and other equipment along independent pathways so that data centres can ride through power outages and remain isolated from surges.
Organisations can compromise their redundancy by failing to track load growth, lacking processes that manage data centre changes, or making poor business decisions because of unanticipated or uncontrolled load growth. The result can be single points of failure, lack of totally redundant and independent systems, or poorly documented maintenance and operations procedures.
Neither airline has yet reported why a single equipment failure could cause such damage, what could have been done to prevent it, or how they will respond in the future.
How can organisations avoid these types of incidents?
Companies spend millions of dollars to build highly reliable data centres and keep them running. They don’t always achieve the desired outcome because of a failure to understand data centre infrastructure and how it works.
The temptation to reduce costs in data centres is great because data centres demand enormous amounts of energy to power and cool servers that require experienced and qualified staff to operate and maintain.
Value engineering in the design process and mistakes and changes in the building process can result in vulnerabilities--even in new data centres. Poor change management processes and incomplete procedures in older data centres are another cause for concern.
Over time, even new data centres can become vulnerable. Change management procedures must help organisations control IT growth, and maintenance procedures must be updated to account for equipment and configuration changes. Third-party verifications can ensure that an organisation’s procedures and processes are complete, accurate, and up to date, mitigating human error and reducing risk.
Maintaining up-to-date procedures is next to impossible without solid management processes that recognise that data centres change over time as demand grows, equipment changes, and business requirements change.
What can your organisation learn from these cautionary tales?
Published reports suggest that Delta thought it was safe from this type of outage. British Airways may have had slightly more warning, with published reports indicate that it had also experienced delays because of problems with its on-line check-in systems. However, other airlines have also experience delays due to IT failures and some may be similarly vulnerable because of skimpy IT budgets, poor prioritisation, merging of systems or flawed processes and procedures.
Neither human error nor equipment failure should ever bring an entire company to its knees. If your business relies on IT for mission-critical services for customer-facing or internal users, then you should consider having a holistic third-party evaluation of your data centre’s infrastructure, management processes and operations procedures.
While top-to-bottom data centre assessments require a substantial commitment of time and money, these costs are miniscule relative to those suffered by organisations who roll the dice with risky management decisions to cut costs and lack a sufficient emphasis on the kinds of operations best practices and training that can prevent catastrophic outages in the first place.
Kevin Heslin, senior editor, Uptime Institute
Image Credit: Joergelman / Pixabay