Large-scale data centre outages will always dominate the headlines, but in reality, everyday incidences of downtime are commonplace. According to the Uptime Institute’s seventh annual Data Centre Industry Survey, 25 per cent of organisations surveyed experienced a data centre outage in the last 12 months, either on their own premises or at a service provider site. And, 90 per cent of data centre and IT pros say their corporate management is more concerned about outages now than they were just 12 months ago.
Case in point — the Delta Airlines outage last year. In this particular instance, a single electrical fault in its Atlanta data centre resulted in a massive system failure, and the grounding of some 2000 of its flights across a three day period. What is interesting about this incident in particular is that the margin for error was so fragile that a single process breaking down led to a £150m meltdown, catastrophic server damage, and tens of thousands of passengers marooned at airports around the world.
Whilst in many cases, the consequences of an outage are less severe than the Delta incident, the financial implications are not always clearly understood by business leaders. However, you can be sure they’ll sit up and take notice when the balance sheet comes in. The Uptime Institute survey also found that only 60 per cent of organisations actually factor in the cost of downtime as a business metric. For a modern business, this should by now be regular practice. Being able to estimate the costs of each minute or hour of downtime can play a key role in ensuring that infrastructure resilience is front of mind for IT professionals.
Of course, understanding the potential risks that can affect the data centre and actually taking proactive steps to accurately predict potential resilience issues before something does go wrong, are two very different things. So, once you are aware of the risks, how do you go about safeguarding your infrastructure?
Flexibility, efficiency and resilience
It’s important to understand that maintaining the efficiency, and the resilience of infrastructure, are the top two priorities for every data centre manager. In terms of operations, this means matching power and cooling to meet demand, and also saving avoidable costs. However, this remains a fluid process, because a data centre must maintain enough flexibility and agility to be able to respond to the needs of the business. This means infrastructure, compute power and performance should be able to scale effectively, often, and with no risk of downtime.
But another oft-overlooked factor for data centres today is the impact of any changes that are made to the data centre environment — implementation of new technology, for instance. Usually, IT doesn’t know how changes such as these will alter the overall efficiency, performance and resilience of the DC environment.
The problem usually lies within a disconnect between the communications of the IT department, and Facilities Management teams. Many organisations have deployed data centre infrastructure management (DCIM) technology with a goal of crossing the data and process gaps that are found within a business regarding its data centres. This is a positive step, but it is by no means comprehensive. Effective communication between facilities and the IT department must be established in order to ensure that strategies are aligned, and that one department is not making decisions that have an adverse effect on the other.
Be prepared for anything
The key to effective data centre management is the ability to predict the impact of any change, no matter how small, to the DC environment. This could be something as simple as installing blanking plates inside racks. Alternatively, it may involve a much more drastic change, such as increasing the power of your facility by 300kW..
Fortunately, there are engineering simulation tools available that allow facilities managers to do this with a high degree of accuracy. Using engineering simulation that combines Computational Fluid Dynamics (CFD) and Power Simulation, facilities managers can evaluate the potential consequences of a change in a safe, offline area by accurately mapping out the environment of the data centre, troubleshooting existing designs and analysing various future scenarios that may affect their facility.
With these tools in place, facilities managers can be confident that, no matter the demands of the business, the most efficient possible strategy can be identified and put in place. As a consequence, downtime can also be made a thing of the past, or at least be mitigated to non-harmful levels.
Simulations can also be used to experiment with ‘what if?’ scenarios. For instance, in the event of a power failure, will any critical systems go offline? Should specific hardware not be connected in a particular way, what would the eventual outcome be? Would it be detrimental to the data centre and have an adverse effect on resilience? If so, how can this be mitigated in a way which doesn’t cause damage? The availability of answers to these questions, and many more, can assist data centre managers in formulating strategies where they can accurately visualise their entire DC power chain, allowing them to safeguard critical hardware and keep downtime to a minimum.
If 90 per cent of data centre and IT professionals say their corporate management is more concerned about outages now than they were just 12 months ago, it is clear that operational resilience must be taken extremely seriously by both IT and Facilities. And, with the right strategy and the best DCIM and engineering tools in place, this can become a reality.
But what about the remaining 10 per cent of management who are indifferent? In all likelihood, either they are 100 per cent sure that their teams have safeguarded the data centre against any and all eventualities, or they will be in for a rude awakening when the next incident of downtime occurs within their organisation. When that happens, they too will soon understand the massive destructive influence of downtime. To avoid this, Facilities Managers must gain understanding of how downtime can affect their datacentres, and fast!
Jon Leppard, Director, Future Facilities
Image source: Shutterstock/hafakot