99.999% availability: Is 5 minutes of unplanned downtime a year really possible?

Real-time access to corporate data underpins the success of many companies today. If a system failure occurs businesses can grind to a halt and employees are forced to pause their work. In fact, one study by Meta Group demonstrated that for medium and larger companies, for every minute of downtime, reported losses in revenue came in at between £21,000 and £30,000.

Unsurprisingly then, organisations tend to go to great lengths to maximise the availability of online data while simultaneously taking steps to minimise the risk of data loss. Achieving 99.999 per cent availability, the equivalent of less than 5.26 minutes of unplanned downtime per year, is regarded as the Holy Grail for such companies.

What is availability?

Availability is measured in uptime. When the IT industry refers to “five nines” availability, this translates into little more than five minutes of downtime a year. Four nines by contrast, equates to approximately 53 minutes of down time over the same period.

These are statistical averages of course, but those additional 48 minutes of uptime readily justify the relatively modest investment required to achieve high availability (HA) and greater redundancy in the SAN (storage area network) infrastructure supporting mission-critical applications.

So, how can you minimise your chances of outages to achieve 99.999 per cent data availability? And what should you look for when considering a HA solution for your IT infrastructure?

How to design a high availability system

Redundant Arrays of Independent Disks (RAID) help to maximise uptime, when supplemented by data replication.

High availability is achieved through a combination of three design elements:

  • High reliability (measured by the Mean Time Between Failures or MTBF) of the storage system and its several subsystems;

  • Redundant subsystems to eliminate as many single points of failure as possible; and

  • Rapid repair of any failure (measured by Mean Time to Repair or MTTR) by using Field Replaceable Units (FRUs) for all critical subsystems.

The following equation for availability demonstrates the vital role of serviceability in the system’s design. Maximum availability can be achieved only by minimising the time it takes to effect a repair, which is reduced significantly by using FRUs. MTBF Availability = MTBF + MTTR

Design for reliability & serviceability (DFRS)

Designing hardware for high reliability and serviceability involves both the system and its several subsystems. To achieve HA at the system level, storage vendors integrate reliability into the design process in several ways.

The first and most obvious is the use of storage device (disk drive) redundancy with RAID configurations (RAID 1, 3, 5, 6, 10 and 50) and dual power supplies, which each includes its own fan to prevent over-heating (and thereby, accelerated component failures).

Even higher availability is achieved by adding redundant controllers to the above set up. By eliminating single points of failure in these critical subsystems, the system itself continues to operate normally during a failure of any single FRU. While such a failure does factor into the subsystem’s MTBF, it does not diminish the availability of the system itself.

What features does a good SAN have?

A good SAN architecture features full redundancy for every subsystem requiring a significant number of active components. The mechanical chassis itself cannot be redundant, of course, and there is a single midplane that performs the simple function of connecting the redundant controllers to the redundant disk drives.

The midplane has minimal active components, however, and the storage array manufacturer selects these for the highest possible reliability. The result is an extraordinarily high MTBF for the chassis and its midplane, and therefore, virtually no impact on system availability and therefore data.

To enhance system serviceability for the shortest possible MTTR, there are two complementary designs. The first relies on a modular chassis with FRUs: the ability to swap out a confirmed failed subsystem quickly and easily minimises the time it takes to repair an installed system and restore it to full operation.

By utilising such a modular design, which provides convenient access to all subsystems, modern data storage products can be maintained seamlessly with minimal or no disruption in service during most repairs.

SAN, storage area network, availability

A modern and sophisticated mechanical design enables the power supply, fan, I/O module, controller and disk drives to be serviced quickly as hot-swappable FRUs. Being able to replace redundant FRUs while the system is fully operational further enhances availability.

The second serviceability technique is immediate notification of any failure. The longer it takes to detect a failure, the longer it will take before it’s repaired of course. Time is of the essence for another reason however: the failure of a redundant subsystem creates, in effect, a temporary single point of failure that increases the risk of a system-level outage. For this reason, the firmware in all good systems is designed to detect, isolate and confirm any failure, initiate a failover to a redundant subsystem, and provide immediate notification. The actual “messaging” of the notification can also be configured to match operational procedures to ensure that on-duty staff is properly and quickly alerted.

HA at the subsystem level

At the FRU or subsystem level, data storage arrays utilise three separate design techniques to maximise the MTBF of each, while at the same time also maximising the inclusion of leading-edge SAN features.

The first technique is reducing the part count. Because any individual part can fail, the fewer there are, the higher the inherent reliability of the subsystem.

The second technique is to use only high-quality parts; these cost more but their superior performance and longer service lives normally contribute to a lower total cost of ownership in the long-run. Despite the higher per-part cost, minimising the part count, while concurrently enhancing functionality, helps to improve the overall price/performance of a highly reliable design.

The third technique involves the de-rating of selected parts. Operating any part or component at or near its rated capacities inevitably shortens its useful service life. For critical parts, the arrays select only those that will be able to operate at approximately 50 per cent of their maximum allowable specifications for voltage, power and/or current. This can substantially increase the service life, and therefore, the MTBF of the subsystem.

Find the right supplier

99.999 per cent availability is a hallmark of a high-performance SAN. While numerous vendors will claim to achieve this there are several other lines of investigation you should make when considering engaging with a new supplier.

If they are claiming five nines, how big is the client base they have achieved this for? If their data is based upon a small handful of organisations you cannot deduce a great deal from their results and therefore you do not have an indication of a guaranteed level of service.

Also, who are these clients? A clear sign of a good and reliable vendor is a set of users that spans a range of sectors, with varying requirements and budgets.

In today’s market, performance, capacity and availability are central to a successful IT infrastructure and hence business. As data remains business critical, downtime remains costly and can potentially have disastrous effects. With minimal investment, 99.999 per cent availability can be achieved to ensure your IT and business is watertight.

By Warren Reid, Marketing Director EMEA at Dot Hill