Skip to main content

The real-world challenges of cloud-based DR orchestration

(Image credit: Image Credit: alphaspirit / Shutterstock)

Legacy storage protection architectures rely on tiers of specialised primary and secondary storage appliances and accompanying backup software. In many scenarios, disaster recovery (DR) needs can only be addressed through separate dedicated DR orchestration software. While these architectures have evolved during the client-server era, they present RPO/RTO, resource utilisation and risk mitigation challenges for modern hybrid cloud environments. In exploring these challenges, it becomes clear that these siloed products -- from multiple vendors with inherent operational complexity – need integration. In this article, we will take a closer look at the challenges organisations face and the advantages of cloud-based DR orchestration services.

RPO/RTO and compute resource challenges

Another alternative technique would be to synchronously replicate every write from the primary to a secondary site. This imposes rigid requirements on the inter-site network latency, however, and there are also high network bandwidth requirements, as well as a need for DR orchestration software to coordinate recovery on the secondary site. Continuous Data Protection solutions can help by providing high levels of data protection for a few carefully chosen workloads, yet they are seldom used as a complete DR solution and don’t eliminate the need for dedicated backup storage appliances.

Traditional backup software delivers 24-hour recovery point objectives (RPO) and recovery time objectives (RTO). Because of the high compute resource required for these processes, the backup is generally conducted once a day during off-hours. While this might be adequate for some backup scenarios, DR generally has more demanding RPO and RTO requirements. Due to the associated performance bottlenecks, as well as the impact of backups on the production workload execution, application owners need better SLAs for DR – a need that simply can’t be met with legacy backup software, forcing administrators to run other dedicated DR solutions in parallel.

Alternative methods that use fewer resources and have a smaller impact on production workloads – such as array-based LUN or volume mirroring – can run on a more aggressive schedule, for instance, every 30 minutes. However, this technique doesn’t have the sufficient backup capability to satisfy regulatory and operational requirements for data protection i.e. it has no extended backup storage, no backup catalogue, and no individual VM or file recovery.

Juggling multiple products – too complex, too inefficient

The number of data transfers carried out by traditional data protection architectures, integrating best-of-breed backup and DR products, is mind-boggling. The data protection part alone involves five different data transfers with the majority requiring I/O-intensive data transformations. On top of that, restoration from backups or a DR failover involves a number of additional data transformations and data transfers.

Let’s break down the process: backup software keeps its data on a specialised backup appliance. As a part of the backup process, the software copies recent changes from the primary storage array to the backup appliance. But primary storage and backup appliances have their own different filesystems. And backup software normally utilises its own client file system, layered on top of the backup array, managing snapshots of protected entities.

With all of this comes data integrity risks…

In the end, administrators are left with a complex web of solutions integrating components from three or more vendors. This leads to increased complexity, ample opportunity for misconfiguration, and staggering levels of resource inefficiencies due to multiple rounds of data transformation with no end-to-end integrity checks. Studies have shown that a business using three or more vendors to supply data protection solutions loses three times more data than those using a single vendor. What’s more, DR orchestration software that relies on third party storage and replication has little chance of detecting in-transit data corruption from a misconfiguration or a software or hardware fault. There are no global end-to-end data integrity checks or APIs that can possibly be applied across all the multiple hardware and software products from different vendors.

What organisations should look for in a cloud-based DR solution

Industry progress is indeed rapid, empowering competitors to constantly build on breakthroughs. Amid such progress, organisations should look for a solution that tends to all aspects of DR. Of course, a solution that’s simpler and significantly less resource-intensive would bring lower RPO and RTO to cloud and on-premises environments. Cloud-based disaster recovery orchestration services have emerged recently as a real alternative, providing end-to-end orchestration for workload protection, backup and replication to cloud or other on-premises sites, DR plan definition, workflow execution, testing, compliance checks and report generation. But how do organisations go about choosing the right solution.

The following provides a checklist of capabilities to look for:

  1. Integration: The integration of all aspects of backup and DR into a single centrally managed system bypasses the need to navigate a web of management consoles and excessive resource usage, due to multiple data copies with expensive data transformations. It also eliminates the need for parallel hardware and software backup and DR stacks, integrating all components and aspects of backup and DR into a single system with unified management.
  2. Storage-level snapshots: Backup via native storage-level snapshots provides RPOs measured in minutes, not hours or days. And if snapshots are at the storage level, it is possible to achieve consistent point-in-time backups across many VMs executing on different servers. Such functionality is not available from 3rd party backup software that relies on hypervisor APIs to take snapshots and copy snapshot state into backups.
  3. Backup accessibility: Backups should be accessible via a searchable catalogue and kept on a cost-effective medium.
  4. Single management console: The ideal solution should offer a single management console in order to establish backup and replication policies that operate on exactly the same abstractions.
  5. Health checks and compliance: Built-in health checks pinpoint problems anywhere in the backup and DR stack, whereby replication failures due to network connectivity losses would flag all affected DR plans. A system that performs compliance checks to assure that the changes in the execution environment do not invalidate DR plans, would also be beneficial.
  6. Integrity checks: End-to-end data integrity checking mitigates the risks associated with multiple data transformations and misconfigurations, regardless of data location or past replication history.
  7. Just-in-time deployment: Optimising the costs of DR is an important TCO consideration. Just-in-time deployment of a cloud DR site presents an attractive alternative to continuously maintaining a warm stand-by cloud DR site. With just-in-time deployment, the recurring costs of a cloud DR site are eliminated in their entirety until a failover occurs and cloud resources are provisioned.

Sazzala Reddy, Datrium
Image Credit: alphaspirit / Shutterstock

Prior to Datrium, Sazzala Reddy was at Data Domain where he worked on building a distributed dedupe file system. He was the CTO of Data Domain after EMC’s acquisition. He has an MS in Nuclear Engineering and an MS in Computer Science, both from the University of Michigan, Ann Arbor.