Although IT disasters are unpredictable, data recovery shouldn’t be. In fact, recovery should be planned, predictable and controlled. The following steps will help you organise your thoughts, ask the right questions and develop a strategy for your DR plan that is closely aligned with your business.
1. Conduct an asset inventory
Disaster recovery planning should always start with an inventory of all your IT assets. This step is necessary to untangle the complexity of your environment. Start by listing all the assets under IT management, including all servers, storage devices, applications, data, network switches, access points and network appliances. Then map where each asset is physically located, which network it is on, and identify any dependencies.
2. Perform a risk assessment
Once you have mapped out all your IT assets, networks and their dependencies, list the potential internal and external threats to each of those assets. Imagine the worst-case scenario — and be thorough. Threats could include natural disasters or mundane IT failures.
Next, include the probability of each happening and the impact it could have. How would it affect the business if each scenario were to occur? Enlist the help of your business colleagues for this exercise — but be sure to emphasise that mundane events happen much more frequently than natural disasters. Move the conversation away from earthquakes and hurricanes and toward higher probability events such as power outages or IT hardware failures.
3.Define criticality of applications and data
Before you begin to build out your business-oriented DR plan, you’ll need to classify your data and applications according to how critical they are to the business. Start by speaking to your business colleagues and support staff.
Look for commonalities and group them according to how important each is to the business, frequency of change and retention policy. You do not want to apply a different DR technique to every application or dataset you have. Grouping assets with similar characteristics will allow you to implement a less complex strategy.
Classifying data in a vacuum based on assumptions could come back to haunt you. Be sure to involve other business managers and support staff in this exercise. You will undoubtedly have to make some trade-offs to limit the number of data classes you have. For medium-sized businesses, the number of classes should likely be between three and five.
4. Define recovery objectives
Different classes of assets and data will have different recovery objectives. For instance, a critical e-commerce database may have very aggressive recovery objectives because the business simply can’t afford to lose any transactions or be down for long. On the other hand, a legacy internal system may have less stringent recovery objectives because the data involved doesn’t change very often and it’s less critical to get back online.
Many IT pros fall short when it comes to this step. Setting recovery objectives without consulting the business line managers is the No. 1 cause for misalignment. It’s imperative that you involve them in this process. The key here is to ask specific questions of key stakeholders, to understand business needs and provide a differentiated level of service availability based on business priority. Once you have that information in hand, it needs to be translated into recovery objectives to be included in your DR plan.
Recovery Time Objective (RTO) refers to the acceptable time any of your data and production systems can be unavailable. To calculate the RTO for an application, consider how much revenue your organisation would lose if the application went down for a given length of time. For example, how much would you lose if your customer portal went down for an hour or a day? How much cost would be incurred if none of your employees can work because email is down?
Calculating your RTO is necessary to determine the features you need in your backup systems and products. For example, if you have a very high RTO (say, more than four hours), you will probably have time to back up from tape, but if you have a very low RTO (such as just a few minutes), you need to use host-based replication or disk-based backup with continuous data protection features.
What is the acceptable amount of data your organisation can afford to lose? That is your Recovery Point Objective (RPO). If your organisation has a high tolerance for data loss, your RPO can be high, from hours to days. If your business can’t afford to lose any data, or very little, your RPO will be seconds. The RPO you set will determine the minimum frequency for backing up your data. If you can only afford to lose an hour’s worth of data, you should back up the data at least every hour. That way, if an outage begins, for example, at 2:30 p.m., you can retrieve the 2 p.m. backup and meet the RPO requirement.
5. Determine the right tools and techniques
The good news is that numerous disaster recovery solutions are on the market today. Just make sure that what you choose offers the appropriate level of protection. Over-protection can cost the company needless money and introduce unnecessary complexity. Under-protection is obviously bad because it puts important business functions at risk.
For instance, nightly backups using traditional (file-based) methods are more than sufficient for low-impact data, but this method would be inappropriate for high-impact data and applications. A continuous data protection (CDP) solution is great for high-impact data and systems, but it can add overhead to production servers and storage costs.
Perhaps the most critical component of your DR plan is offsite protection — use it regardless of the type of backup method you choose. Offsite protection (be it a tape vaulting service or replication to the cloud) should be commensurate to your recovery objectives. Make sure your data is sent to a location that is far enough away that it is not in the same geographic risk zone. Typically, this is at least 25 miles away from the primary location.
Finally, automate and streamline the recovery process as much as you can. In the event of a disaster, key IT staff may be unavailable. Automation also lessens the risk of human error.
6. Get stakeholder buy-in
Go beyond the walls of the datacentre and involve key stakeholders for all your business units. They need to be involved in the planning phase, and they should agree with you on the company’s priorities as well as the service-level agreements (SLAs) your team will provide.
Also, consult your strategic partners and vendors to make sure you’re getting the most out of your DR solution or services. Once you have consulted all of the key stakeholders, enlist an executive-level sponsor who will get behind you and the project. The importance of collaboration, consensus and executive support to your DR plan’s success cannot be emphasised enough.
7. Document and communicate your plan
In a disaster scenario, you need a documented strategy for how to get back to a working state. This document should be written for the people who will use it.
Importantly however, this documented plan must be communicated. All too often, only one person in the organisation really knows the whole picture, leaving the organisation vulnerable if that one person is unavailable during a disaster. In addition, be sure to store your recovery strategy where it can be accessed during a disaster — not on a public share in your Exchange folders. Ideally, it should be printed and posted in multiple locations.
8. Test and practice your DR plan
People often say, “Practice makes perfect.” A better saying might be, “Practice makes progress.” No organisation ever gets to perfection with its DR plan, but practice will help you find and rectify problems in your plan, as well as enable you to execute it faster and more accurately. Make sure that everyone who has a role to play attends the practice sessions, although you do not need to practice executing the full disaster recovery plan every time. It’s perfectly acceptable to carve out pieces of your plan to test.
9. Evaluate and update your plan
A DR plan should be a living document. It’s especially important to regularly review your plan given the shifting sands of an ever-changing business environment. Tolerance for downtime and data loss may decline, and key personnel may go on leave or terminate their employment. IT might migrate to new hardware or operating systems or the company might acquire another company. As your organisation is dynamic and ever changing, your plan needs to reflect the current state of the organisation, whatever that may be.
Adrian Moir, senior consultant, product management, Quest
Image Credit: alphaspirit / Shutterstock