How to survive the next Azure outage


On September 4 2018, the South Central US Region of Microsoft’s Azure cloud experienced a catastrophic failure that knocked out an entire datacentre, causing some customers to be offline for more than two days. The forensic analysis revealed that a severe thunderstorm had led to a cascading series of problems, which began with a failure in a redundant chiller and ended in physical damage when some systems overheated.

Stuff happens. Failures are inevitable. But here is the untold story from that day: Those customers who had implemented their own robust disaster recovery and/or high-availability provisions, whether within or atop the Azure cloud infrastructure, were barely affected by either downtime or data loss during this major outage.

This article examines four options for providing disaster recovery (DR) and high availability (HA) protections for applications running in hybrid and purely public cloud configurations using Azure. The focus here is on Microsoft SQL Server because it is a popular Azure application that also has its own HA and DR provisions, but two of the options also support other applications. The four options, which can also be used in various combinations, include:

  • the Azure Site Recovery (ASR) Service
  • SQL Server Failover Cluster Instances with Storage Spaces Direct
  • SQL Server Always On Availability Groups
  • Third-party Failover Clustering Software

Before discussing these options, it is helpful to understand some availability-related aspects of the Azure cloud within sites, within regions and across multiple regions. During what Microsoft calls the “South Central US Incident,” many Azure customers were surprised to find out that having servers in different Availability Sets distributed across different Fault Domains offered no protection for an outage affecting an entire datacentre. The reason is that, while each Fault Domain resides in a different rack, the racks in an Availability Set are all in the same datacentre. Such configurations do afford some HA protections (for example, from a server failing), but they provide neither HA nor DR protection during a site-wide failure.

For protection from single site-wide failures, Azure is rolling out Availability Zones (AZs). Each Region that supports AZs has at least three datacentres that are inter-connected with sufficiently high bandwidth and low latency to support synchronous replication. Azure provides a 99.99 per cent uptime guarantee for configurations using AZs, but Caveat Emptor: downtime excludes many common causes of failures, including customer and third-party software, and what might be called “user error”—those inevitable mistakes made occasionally by all administrators. AZs are nevertheless an effect means for maximising uptime in some Azure configurations, and had they been available and implemented properly during the South Central US Incident, they would have enabled a rapid recovery.

For even greater resiliency, Azure offers Region Pairs. Every region is paired with another within the same geography (such as US, Europe or Asia) separated by at least 300 miles. The pairing is strategically chosen to protect against widespread power or network outages, or major natural disasters. Microsoft also takes advantage of the arrangement to roll out planned updates to each pair, one region at a time.

The four options discussed here are able to leverage these availability-related aspects of the Azure cloud to deliver the different levels of HA and DR protections needed by the full spectrum of enterprise applications.

Azure site recovery (ASR) service

ASR is Azure’s DR-as-a-service (DRaaS) offering. With ASR, physical servers, virtual machines and Azure cloud instances are replicated to another Azure Region or from on-premises instances to the Azure cloud, ideally in a distant region. The service delivers a reasonably rapid recovery from system and site outages, and can be tested in an easy, non-disruptive manner to ensure failovers will not fail when actually needed.

Like all DRaaS offerings, ASR has some limitations. For example, WAN bandwidth consumption cannot exceed 10 Megabytes per second, and that may be too low for high-use applications. More serious limitations involve the inability to automatically detect and rapidly failover from many failures that cause application-level downtime. Of course, this is why the service is characterised as being for disaster recovery and not for high availability.

Even with these limitations, ASR provides a capable and cost-effective DR solution for many enterprise applications. The service replicates the entire VM and enables reverting to a prior snapshot. Runbooks can be used to automate the sequential steps in the recovery to prevent operator errors. The recovery process must be activated manually, however, because ASR does not monitor for failures or initiate any failovers.

The two metrics normally used to assess HA and DR provisions are the Recovery Time Objective and the Recovery Point Objective. RTO is the maximum tolerable duration of an outage, while RPO is the maximum period during which data loss can be tolerated. ASR can accommodate an RTO as low as 3-4 minutes depending, of course, on how quickly administrators are able to detect a problem and respond. RPOs vary greatly depending on the application’s rate of change. ASR can accommodate RPOs measured in minutes, but for high-use applications that require minimal or no data loss (an RPO close to zero), a more robust DR solution is needed.

SQL server failover cluster instances with storage spaces direct

Many commercial and open source software offerings provide their own, sometimes optional HA/DR capabilities, and SQL Server offers two such features: Failover Cluster Instances (discussed here) and Always On Availability Groups (discussed in the next section).

The use of FCIs (available since SQL Server 7) affords three major advantages: it is available with SQL Server Standard Edition; it protects the entire SQL Server instance, including system databases; and it imposes no limitations with Distributed Transaction Control. A major disadvantage for HA and DR needs has been its requirement for cluster-aware shared storage, which has traditionally not been available in public cloud services.

A popular choice for SQL Server FCI storage in the Azure cloud is Storage Spaces Direct (S2D), which was introduced in Windows Server 2016 with concurrent support in SQL Server 2016. S2D is software-defined storage that creates a virtual storage area network. It can be used in configurations with two FCI nodes in the Standard Edition and with three (or more) nodes in the Enterprise Edition.

A major disadvantage of S2D is that the servers must reside within a single datacentre. Put another way: the configuration is not compatible with Availability Zones, Geo-clusters and the Azure Site Recovery service. As a single-site HA solution, the combination of FCIs and S2D is a viable solution. For multi-site HA and DR protections, data replication will need to be provided by log shipping or a third-party failover clustering solution.

SQL server always on availability groups

Always On Availability Groups is SQL Server’s most capable offering for HA and DR. First released in SQL Server 2012, the feature is available only in the more expensive Enterprise Edition. Among its advantages are being able to accommodate an RTO of 5-10 seconds and an RPO requiring minimal to no data loss, a choice of synchronous or asynchronous replication, and readable secondaries for querying the databases (with appropriate licensing). The Enterprise Edition of SQL Server also places no limits on the size of the database and permits HA/DR configurations with three nodes.

One popular configuration that affords robust HA and DR protections is a three-node arrangement with two nodes in a single Availability Set or Zone, and the third in a separate Region, preferably as part of a Region Pair. One notable limitation is that Always On Availability Groups replicate only the user-generated database(s) and not the entire SQL instance, including any system-generated databases. This is why configurations like these often employ third-party failover clustering software for a more complete HA/DR solution.

In addition to the higher licensing fee for the Enterprise Edition, which can be cost-prohibitive for some database applications, this approach has another disadvantage. Because it works only for SQL Server, IT departments need to implement other HA and DR provisions for all other applications. The use of multiple, application-specific HA/DR solutions increases complexity and costs (for licensing, training, implementation and ongoing operations), which is another reason why many organisations prefer using a “universal” third-party solution for failover clustering.

Third-party failover clustering software

The major advantages of third-party failover clustering software derive from its application-agnostic and platform-agnostic design. This enables the software to provide a complete HA and DR solution for virtually all applications in private, public and hybrid cloud environments, as well as for both Windows and Linux.

As complete solutions, the software includes, at a minimum, real-time data replication, continuous monitoring capable of detecting any failure at the application level, and configurable policies for failover and failback. Most solutions also offer additional advanced capabilities that frequently include a choice of synchronous or asynchronous replication, WAN optimisation to maximise performance, and manual switchover of primary and secondary assignments for performing planned maintenance and routine backups without disrupting the application.

Being application-agnostic eliminates the problems caused by having different HA/DR provisions for different applications. Being platform-agnostic makes it possible to leverage various capabilities and services in the cloud, including Azure’s Fault Domains, Availability Sets and Zones, Region Pairs and Azure Site Recovery.

Other advantages include satisfying RTOs as low as 20 seconds and RPOs of minimal to no data loss, and the ability to protect the entire SQL Server instance with FCIs in the less expensive Standard Edition. Two notable disadvantages are the inability to read secondary instances of databases, and the additional cost of implementing and maintaining a separate HA/DR solution atop the Azure cloud. But given the inability of Azure and other clouds to detect common causes of failure at the application level, having a separate solution is necessary when running mission-critical applications.

Comparing the options

The table provides a summary, side-by-side comparison of all four options. It is important to note that these options are not mutually exclusive; that is, they can be used in various combinations to achieve the most cost-effective HA and/or DR protection needed.

For example, for database applications that are not mission-critical, SQL Server FCI with S2D can be used for (single-site) HA, and Azure Site Recovery can be used for DR. For the most critical database applications, a combination of third-party failover clustering software and Always On Availability Groups makes it possible to create a three-node configuration (with readable secondaries) capable of failing over automatically and almost instantaneously from virtually any outage of any extent anywhere in the cloud, whether purely public or hybrid.

In this summary, side-by-side comparison, the darker the dot, the better the feature is supported, with the black one indicating robust support and the transparent one indicating the feature is unsupported.

In this summary, side-by-side comparison, the darker the dot, the better the feature is supported, with the black one indicating robust support and the transparent one indicating the feature is unsupported.

To survive the next Azure outage, including one like the South Central US Incident, make certain that whatever high-availability and/or disaster recovery provisions you choose are configured with at least two nodes spread across two regions, preferably in a Region Pair. Also be sure to understand how well recovery time and point objectives are satisfied, and be aware of the limitations, including the need for any manual processes required to detect all possible failures and trigger failovers in ways that ensure both application continuity and data integrity.

Jonathan Meltzer, Director, Product Management, SIOS Technology
Image source: Shutterstock/hafakot