Getting your Disaster Recovery process right – three areas to concentrate on

In 1912, more than 100 years ago, the world witnessed a tragedy of epic proportions when the Titanic sunk to the depths of the ocean. While the ship was lauded as being “unsinkable,” there were may areas where poor planning and lack of organisation led to the disaster being worse than it needed to be.

Much like crashing into an iceberg in the middle of the ocean, the circumstances leading to data loss and disaster might seem unlikely for IT teams.

This hubris can cause those steering the ship of today’s business to think they may easily come out without negative impact. However, that simply is not the case. In our 2015 State of Resilience Report, nearly half of IT professionals surveyed experienced a failure that required use of their High Availability (HA) or Disaster Recovery (DR) solutions to resume operations.

What’s more, nearly 50 per cent of companies that experienced a failure on the storage side lost data in the process, due to insufficient disaster recovery methods or practices.

When it comes to business-critical data, data loss like this simply cannot be tolerated. So why aren’t more companies investing in solutions to prevent data loss? There are three main areas where planning is crucial: storage, security, and planned downtime. Individual improvements in each one of these areas can improve the chances of DR plans working well when required; not thinking about them can lead to serious issues.

Taken together, planning ahead around these three themes can dramatically improve the chances of successful recovery and in faster times as well.

  1. Storage Failure – the main culprit for data loss

Storage failure is a leading factor for data loss in a disaster recovery scenario. Knowing this, companies need to better understand the different reasons why hardware and software malfunctions occur in the first place. This can then be used to plan ahead so issues don’t arise.

Many companies select tape as their primary method for disaster recovery, making the assumption that tape is ‘good enough’ to suit their needs. After all, it has existed for years, and therefore can be trusted, right? In certain instances tape can be sufficient, especially if it is included in a mix of other disaster recovery options. However, companies should be aware of the shortcomings of tape and the risks associated with tape backup as well.

While tape backup can be implemented well, it can also be done poorly. Several factors can render the information stored on a tape useless. Some of these factors can be prevented through good planning, while others are wider issues that the IT team as a whole has to consider.

Disaster recovery

Image source: Shutterstock/dotshock

One of the biggest causes of problems for recovery from tape is the fact that this is a physical medium. This can lead to issues over time. For example, tape technology continues to evolve just like every other IT sector, but many companies invest in tape for the long term. As companies update their approach to DR and continuity, what happens to the old tapes? Incompatibilities can exist between the machine that manufactured the tape and the playback equipment if not carefully managed; while it is possible to hold on to old tape technology, it can take up lots of valuable office real estate.

In rare cases, the tape has simply aged beyond the point where it is compatible with any current playback equipment. Another physical problem can occur when data has been corrupted on the storage system to the point where it is no longer readable. This means that it is important to test tapes regularly to check that they are still viable.

The main advantage of tape is the cost of the storage available and the ability to reside without consuming power. Recently, IBM and Fujitsu announced a tape-based storage solution with a recording density of 123 billion bits per square inch. Running on low cost, particulate magnetic tape, the sheer density of data that can be stored on this make tape appealing. While tape still makes its case as a relevant storage medium to meet the growing demands for large amounts of back-up and archival data, it’s important to put tape into context. As part of this, assess the optimal mix of storage environments for what is required alongside other forms of backup and recovery.

  1. Security and Malware – planning to stop attacks affecting DR

Security attacks and malware can contribute to storage failure too. For example, in 2012, a particularly sinister malware attacked energy organisations in the Middle East, destroying hard disks on infected systems. The malware corrupted files, overwrote the infected machine’s master boot record and decimated data past the point of recovery.

This attack is not an isolated incident. Other malware families have encrypted files and then demanded payment for them to be released, while others have broken into sensitive information stores as part of nation-state campaigns. From a DR perspective, there are three linked elements to consider here: how to stop malware getting included in backup systems in the first place; how to go about getting back to a “known good” state if a malware attack is successful; and how to use DR as part of any clean-up operation after an attack.

[full_width_ad]

When malware is unleashed, as in the Middle East example above, the recovery process can be made more difficult if there is a case of vendor lock-in. Here, data can only be recovered where the disk subsystem manufacturer conducts all disk recovery on their own product. This is an impediment to getting data back quickly. If and when a company needs to move data off the infected disk, this hardware protection element prevents them from doing so. It’s therefore important to be mindful of these restrictions and select products that are storage solution-agnostic where at all possible.

Alongside this, it is also worth understanding how malware attacks and corrupted files can be affected over time. Many DR solutions use snapshot approaches to protect data, where a copy of the server will be taken every half an hour or hour. Theoretically, if an outage occurs, the snapshot saves all data except for that specific 30 to 60 minute period

However, prolonged failure can render that data inaccurate and useless, particularly if the flaw is not discovered for a while. It may better than no backup system whatsoever, but snapshot HA/DR can be fundamentally flawed on its own. Real-time replication offers a much more effective and reliable data backup solution. As soon as data changes on one computer, this change is copied and sent to its final backup destination. This synchronisation provides the most accurate, up-to-date record of business operations over time. Replication can be combined with snapshot technology as well, so data can be captured in real time while snapshots can be taken and stored as “known good” states as well.

  1. Achieving Planned Downtime Perfection

While natural disasters and other circumstantial crises can justify incorporating HA/DR as an insurance policy against data loss, the reality is that these scenarios are only the tip of the iceberg. According to our work with customers, 90 per cent of downtime is planned in advance. Companies undergo regular, planned downtime periods for operating system or database software upgrades, hardware upgrades, system location migrations and maintenance upgrades. These migration projects are necessary to keep IT running smoothly, but they each represent a time where information is vulnerable to increased risk of loss.

Planned downtime should ideally proceed smoothly. However, if it is not executed thoughtfully – and with the right kind of processes in place – even planned downtime can lead to data loss disaster. Planning ahead for downtime should provide an opportunity for companies to test their business continuity strategy. This ensures that employees possess the appropriate skills and are confident enough to carry out their roles, as well as that any technology assets are functioning correctly.

Previously, testing would involve full-scale switchover between the primary systems and the second site. While it should all go smoothly, it can represent a huge potential cost to the business in terms of lost revenue. This often meant that tests would be infrequent. Today, Cloud computing can help companies test their HA / DR implementations while keeping their production systems in place. By spinning up servers in the Cloud, testing can be carried out in a way that does not impact on day-to-day activities, while still reassuring IT and business leaders that their activities are effective.

Alongside this, planned downtime windows can be reduced by making best use of DR tools in everyday migrations. Replication tools can be used to store data created by applications while a migration is being carried out, then applied to the new systems as part of a short cut-over.

This means that the business can keep running operations as normal rather than needing to stop work while the migration is carried out. This represents a great opportunity for IT to add value back to the business, rather than affecting everyday operations.

Just like the Titanic, IT operations can often be categorised as large, unwieldy and difficult to change when they are in motion. However, approaching DR in the right way means that IT teams can taken some pain points out of their everyday plans while also improving the results that are delivered back to the business.

Ian Masters, Vice President Cloud and Strategic Alliances, Vision Solutions

Image source: Shutterstock/Alexander Mak