Skip to main content

Data centre complexity and risk: the human factor

This article was originally published on Technology.Info.
As part of our continuing strategy for growth, ITProPortal has joined forces with Technology.Info to help us bring you the very best coverage we possibly can.

Human error is the biggest cause of failure in industry. Around

80% of major accidents

– think Chernobyl, the Challenger space shuttle, and Concorde – have been caused by human error in one form or another. The data centre industry is no exception to this rule, with the

Uptime Institute estimate suggesting that 70% of failures

there were the result of human error.

What does this mean for data centres? We caught up with

Operational Intelligence's David Cameron

to find out more. He believes significant reduction in risk, and improved operational efficiency can only be achieved by engaging with the human element in data centres: the people that run them.

Theory, Practice, Experience and Reflection

Many data centres document risks about single points of failure, failure mode and effect analysis and failure mode, effect and criticality analysis. However, staff working at the data centre may well be unaware of these documents, reducing their usefulness.

The experience of the individual, and the accumulated experience of the company interact to improve safety and energy efficiency in any organisation. An essential element of this is learning, both organisational and individual. When we look at how individuals learn, research by Kolb has suggested that it’s in a cycle of Theory, Practice, Experience and Reflection. But in real life, those cycles are seldom all completed by the same person when setting up a

new technology in a data centre

, because the construction industry tends to use a model that works best where the technology is not integral to the building.

The Swiss Cheese Model

Managing and predicting

risk of failure

also becomes more difficult in a more complex system. James Reason’s

Swiss Cheese model

makes clear why this is so. He posited that each layer of protection from failure has holes in it. Usually, total system failure doesn’t happen because the holes in each layer do not line up. But every now and then, something in every protective layer will fail. If those failures align, the whole system will fail. The more layers, and the more interactions between them and within them, the more difficult it is to predict when and how the holes will line up, and when and how failure will occur, except that it will.

But all is not lost. The best way to reduce risk of failure is to share information about systems and possible failure points. If we can learn more about the system, then we can reduce the number of ways in which the system can fail that we don’t know about. That way, we can guard against more of the risks, and make failure less likely. And the best way to learn and share information is to adopt a learning culture, rather than a ‘blame’ one. A blame culture will lead to silo working and protection of groups. A learning culture will lead to sharing of vital information to avoid future failure, and also ensure that staff are keen to learn from contractors before they leave.

Keep it simple

But perhaps there is another, more crucial learning point. As well as sharing information, the other way to reduce risk is to reduce complexity. It is always tempting to make things cleverer, smarter, that bit more complicated. But the challenge to avoid the human error element is to keep things simple, so that everyone can understand them and see when failure is likely.

To learn more about effective transfer of knowledge between data centre delivery and operations teams, attend

Dave Cameron’s workshop taster

at the

DCA virtual track program

at Data CentreEXPOentitled 80% of Catastrophic Failures are Due to the Human Element.