Experiencing a downtime incident is the worst way to find out that your data centre risk management and operations practices are insufficient.
In today’s global, 24-hour economy, an organisation’s IT infrastructure is business critical. Any large, complex system such as a data centre demands continued vigilance to maintain performance.
However well designed a facility may be on paper, its reliability ultimately stands or falls on the day-to-day operations of the site. If the operations team is not focused on continuous quality improvement, the inevitable result is not stasis, but decline. How do you know if day-to-day operations procedures are effective? How can you tell if the operating culture and practices are sufficient to manage risk, or if your facility is in danger of having an unexpected error or failure?
Operations are as important as infrastructure
Designing and building a new business-critical data centre is a complex and expensive prospect. But even the most robust design and infrastructure will not keep a site from having an outage if the individuals running it do not follow procedures. From the day the centre opens its doors, all of the investment in high-availability infrastructure - and the business mission it supports - can be put at risk if effective management and operations procedures are not established and maintained.
It is easy to focus on facility infrastructure and equipment as a safeguard against downtime. However, statistics show that the most significant cause of data centre incidents is not mechanical failures, but the result of human operations, or “human error.”
Issues in the data centre are often the cumulative aggregation of poor management decision-making and organisational misalignment. While many errors may appear on the surface that they can be attributable to one person’s mistake, they are almost invariably, the downstream impact of leadership or management policies and decisions, or a reflection of the broader operating environment and culture of the organisation. Even a great facilities team can be stymied by scarcity of resources, unclear mandates, or a lack of management support.
Applying management and operations best practices at all levels of the organisation can minimise the risk of human error. Existing data centres may have infrastructure vulnerabilities due to aging facilities or equipment, but they can still reduce downtime risk and even outperform centres that have better “on paper” topology if the operations team is working effectively. It’s never too late to identify and close any gaps or omissions, refresh processes, and correct bad habits that may have crept in over time.
The consequences of failure can be significant in site impact, business cost, and market perception, thus it is worth the effort to shore up operational standards. Applying rigorous best practices will help make the most of aging assets while keeping them functioning at optimum levels, reducing risk, and achieving maximum efficiency.
Five Questions You Should Be Asking Yourself:
To uncover operating risks and make the cultural and management shift to effective practices, organisations must start by asking some tough questions:
- Can you easily replace any member of the team? If not, this indicates that roles and responsibilities are not clearly defined, and processes aren’t well documented.
- Are you protected against poor operations practices migrating from older sites to higher criticality data centres? If not, this indicates a lack of consistently applied standards across the portfolio.
- Do you have sites that operate in isolation, ignoring global corporate standards? This issue often arises in the wake of mergers and acquisitions, or due to a “Lone Ranger” problem at specific sites.
- Do you even have corporate global standards? Everyone in the organisation needs to be clear about the overall mission and objectives.
- If you outsource any aspect of your data centre operations, how do you avoid losing responsibility and accountability? You need vendor teams to act as an extension of your team, with training and adherence to the same policies and procedures.
Assessing the risk
Even the world’s leading data centres can have oversights or operating shortfalls. Studies have shown that certain conditions in a data centre correlate with higher error rates. Some of these conditions may be causal, creating opportunities for errors and outages to occur. Some of them are merely symptoms of an operating environment or culture in which errors are more likely to happen. But any of them should be a red flag signaling you to take a close look at practices and procedures in your data centre.
They are all indicators of personnel stretched too thin and daily operating practices that prevent teams from being able to maintain regular and proactive processes - putting your organisation at high risk for unplanned downtime.
Lee Kirby, Chief Technology Officer, Uptime Institute
Image source: Shutterstock/wavebreakmedia