Storage reliability: It ain't what you do, it's the way that you do it

This article was originally published on Technology.Info.
As part of our continuing strategy for growth, ITProPortal has joined forces with Technology.Info to help us bring you the very best coverage we possibly can.

When looking into storage reliability, it’s all too easy to get caught up in the “hard disk drives are unreliable” melodrama being created by most of the all-flash array vendors. To argue that one media is more reliable than another is analogous to arguing that cars are more reliable than trucks – they’re two different tools for two different jobs.

The good news is that when it comes to traditional hard drives, there’s actually more statistical evidence out there to help organisations architect solutions based upon balancing cost and risk. One more piece of evidence that’s recently emerged is a study by Backblaze, a cloud backup provider who’s been looking at their 25,000 odd disk drive implementation and plotting out lifecycles. In the past we’ve seen studies, such as this from Carnegie Mellon University, however Backblaze looked at both consumer and enterprise grade drives.

What it initially saw was of no great surprise to those who work in the enterprise storage space. It showed in its first study that its SATA consumer grade drives had an annual failure rate (AFR) of 5.1 per cent in the first 18 months and that drive failures followed a “bathtub” curve with a dramatic increase in years four and five of the drives’ lifecycle. Why do you think the vast majority of storage vendors are happy to give a three year warranty but get a little jumpy when you ask for an inclusive five year one? Well here’s the answer in this graph right here, the drives are much more likely to fail once they get into later life.

Now your first thought may well be “Ah, but these are consumer grade drives – things will be much better with enterprise grade!” Well this is where things get interesting, as Backblaze then went on to look at its enterprise grade drive implementation and plotted drive years of service against failure rates to show an AFR for enterprise drives of 4.6 per cent against consumer drives figure of 4.2 per cent. Now it’s not a direct comparison, as the enterprise drives are small in number and have only been installed for two years. There’s also an interesting point raised by Seagate on its blog, saying that Backblaze created the “perfect storm” with its use case and physical mounting. This proves a point that a select few in the storage industry have been making for a while of “It ain’t what you do, it’s the way that you do it.”

Anyone can build a storage array. Pop down to your local PC supplies company, grab some drives, grab a server, get an OEM drive shelf enclosure, pop them in, load up some open source software and hey presto – you’ve got an “enterprise grade storage array”. Well that’s what some manufacturers would you believe anyway. The truth is that hard disk drives are sensitive little creatures. Take a look at an excellent video by Sun Microsystems (remember them?) back from a few years ago. The video was produced to show off its funky new software that could analyse drive latency but it proved the point that drives are sensitive to vibration – in this case an Australian engineer shouting at them. Vibration and noise aren’t the only drive killers – heat and density are a big factor too. Add in the error correcting capabilities of consumer grade drives and you start to see some of the AFRs that Backblaze saw.

So how come some vendors such as X-IO (and to a degree, Backblaze with its home-cooked enclosures) have been able to solve this core issue? Well the key here is good old-fashioned hardware engineering. The key is to acknowledge that drives are sensitive to such elements and deal with them. Stop them vibrating, keep them cool and treat them with a little bit of respect by mounting them evenly and horizontally. If I held you vertically, jiggled you about and kept you at high temperatures for a few years, you’d probably feel a little poorly too. X-IO has then gone a stage further and uses patented software to rebuild the drive in the case of errors but the fundamental hardware design really does make a difference.

Yes some vendors run predictive failure software and will argue that it’s no big deal to send an engineer or just a replacement drive out but how many people can reel off an anecdote about a flat footed clumsy engineer swapping the wrong drive, knocking a cable out of hitting the EPO instead of the exit button in your data centre?

The fact is that with the arrival of new approaches such as software-defined storage, the temptation to use lower cost components such as cheap commodity desktop drives is going to be rife. When we start to strip away some of the increasingly unnecessary core controller feature sets of many enterprise storage arrays, we’ll be left with the same OEM disk shelves that everyone has. If they haven’t solved some of the crucial hardware design challenges such as vibration and cooling then you’ll have an array that not only can be unpredictable in terms of reliability but also unpredictable in terms of performance.

Yes you could take the alternative approach of using an all flash array but then do you really need a truck to pick up a loaf of bread from the supermarket?