IT Infrastructure failing as if the past two decades never happened - Part 2

null

In Part 1 of this series, we examined recent data centre outages and the reasons why these “cautionary tales” came to pass. Now, let’s discuss practical tips for minimising the risk of outages in business-critical infrastructure.

Getting past misconceptions 

Human error and/or equipment failure is often cited as the root cause of many engineering system outages, but most of the time, these elements don’t actually cause major disasters on their own. Rather, they are symptoms of a larger issue – poor management and operations practices. Leadership decisions and priorities that result in a lack of adequate staffing and training, an organisational culture that becomes dominated by “fire drills,” or budget cuts that reduce necessary maintenance, could result in pervasive failures that flow from the top down.

Although front-line operator error may sometimes appear to cause an incident, a single mistake (just like a single data centre component failure) isn’t typically enough to bring a robust complex system to its knees – unless the system is already teetering on the edge of critical failure as a result of numerous underlying risk factors.

It’s true that vulnerabilities are present within even the best-designed data centres. Companies with complex IT systems combat the risk of failure with multiple layers of protection and backup. So again, when IT failures take place, it’s not due to a lack of backup systems or any one issue in particular, it’s an indication of poor management. Catastrophic data centre incidents like the ones we saw in 2017 are avoidable if organisations design their infrastructure up to industry standards, with redundancy and other preventative measures baked in, and implement stringent management and operations best practices.

Every business should conduct thorough failure analyses and apply the lessons learned when developing and refining their program, in order for business-critical facilities to become resilient and successful over the long term. Every organisation’s responsiveness, familiarity, and adherence to documented procedures are key to evaluating performance.

Practical considerations for minimising risk 

Throughout the past 20 years, Uptime Institute has delivered operations assessments across hundreds of data centre facilities and has identified key management shortfalls that increase risk. Many data centre programs – even rigorous operations that have been successful – are subject to various risks and can be improved through continuous assessment and development. Take a moment to review your program with an objective eye; if you can answer yes to any of the following questions, you may be experiencing a crisis in management rigor:

·         Are data centre staff voice mail boxes full, emails not responded to, email inbox size limit exceeded?

·         Are critical meetings missed or routinely cancelled?

·         Does your data centre team report a lack of time for training?

·         Are there any whisperings about a potential shortage of qualified staff?

·         Are certain team members performing work outside their competency?

·         Does your staff experience high personnel turnover?

·         Has maintenance exceeded its budget? How about energy cost estimates?

·         Does the back of your servers or cable trays look like a spaghetti pot blew up?

·         Does your equipment and cabling lack clear labelling systems?

It can be relatively easy to determine other underlying risk factors that are being left untended by management. Walk through your facility and ask yourself these questions to ensure the appropriate processes and documentation are in place: 

·         Are there any combustible materials on the raised floor, in the battery room, or electrical rooms? All incoming equipment should be stripped of packaging outside of critical space.

·         Are unrelated items—office furniture, shelving units, tools—stored in critical space? This is a fire, safety and contamination issue. 

·         Do any fire extinguishers on the premises have out-of-date tags?

·         When was the last time you reviewed housekeeping policies and procedural documentation?

·         If the facility operates a raised floor, what is the condition of underfloor plenum? This area should be cleaned regularly — ask to see the schedule.

·         How many employees have access to the critical space? Does your organisation even have an access policy for staff?

·         Are non-vetted individuals being allowed in critical areas? Ask to see the vendor check-in and training requirements; non-vetted individuals should never be allowed.

·         Are panels, switchboards, and valves labelled to indicate “normal” operating positions?

·         Is arc ash labelling installed on all panels and PDUs?

For over a decade, data centre cooling practices have called for air flow isolation—cool air delivered to the front of a rack of IT equipment and hot air exhausted out the back. In a raised floor environment, rows of equipment are typically arranged in a Hot Aisle – Cold Aisle configuration, in which perforated tiles deliver cool air to the cold aisle or server intakes. When reviewing your organisation’s cooling procedures, consider the following indicators of poor bypass air flow management. These factors can result in heightened risk, cooling inefficiencies, wasted money and poor adherence to key management best practices:

·         There are grated or perforated panels in the Hot Aisle. 

·         There are unsealed cutouts in the raised floor.

·         There are uncovered gaps in the racks between IT hardware.

Here are several other key steps that can help identify elements of your data centre that constitute poor management procedures and increased risk of downtime:

·         Ask to see records and schedules for maintenance activities on batteries, engine generators, and mechanical systems.

·         Review staffing documentation—overtime rates greater than 10 per cent can lead to an increase in human error, which can increase the likelihood of an outage. Are roles and responsibilities documented? Are qualifications listed?

·         Ask to see list of preventive maintenance activities. Are the activities fully-scripted? What is the quality control process?

·         Find out who keeps critical documentation on equipment, including warranty info, maintenance records, and performance data.

·         Revisit your process for maintaining the reference library (staffing, equipment, maintenance, procedures, and scripts).

·         Analyse your team’s training records, annual budget, and time allocation.

Organisations are continuing to adopt various new IT models to deal with the ever-growing reliance on technology and data in modern business. As such, availability has never been more important. While it’s virtually impossible for an organisation’s site processes, procedures, and site culture to be perfect, successful IT infrastructure teams remain hyper-focused on preventing failure. This means staying vigilant at all times and constantly addressing (and readdressing) the considerations listed above to pinpoint hidden vulnerabilities in your IT operations, which can serve as the basis for productive conversations about change and improvement. The fact that your facility hasn’t experienced an incident yet doesn’t mean it’s immune. A solid commitment to management and operations excellence can have a tremendous impact on the performance of your IT infrastructure, so ask the hard questions and cover all your bases to eliminate preventable outages.

Lee Kirby, president, Uptime Institute
Matt Stansberry, senior director of content & publications,
Uptime Institute
Image source: Shutterstock/everything possible