Failure is an option in IT. In fact, failure is an ever-present option, and some can be whoppers. Consider, for example, when the South Central US Region of Microsoft’s Azure cloud experienced a catastrophic failure. It began with a thunderstorm that set off a cascading series of failures, culminating in an entire datacentre going down. Some customers without IT resiliency provisions experienced over two days of downtime.
Microsoft is addressing the many issues that led to the outage, but the incident serves to remind IT professionals of the importance of resiliency in the cloud. There are, of course, ways to survive virtually any outage anywhere at any time—from a single server failing to an entire datacentre going offline. The technologies needed are proven in practice, and there are numerous options available to meet different needs and budgets. With so many options, the real challenge is making the optimal choice for each and every application.
This article provides an overview of achieving IT resiliency for SQL Server with the three major cloud service providers.
The six imperatives of IT resiliency
Deloitte Consulting LLC uses a business continuity framework that consists of these six fundamental imperatives of IT resilience:
- Defend – The first imperative involves taking steps to reduce the probability of system failures or of downtime that can be caused by any single points of failure. All potential failure scenarios should be considered, especially for mission-critical applications.
- Detect – Continuity cannot occur without first knowing an outage has occurred, and the sooner the better with continuous monitoring capable of automatically detecting failures.
- Remediate – Once an outage is detected, either an automatic or manual response is needed to ensure that IT can somehow continue providing essential services.
- Recover – For critical applications, recovery should be expected to occur rapidly and fully (ideally automatically), and with little or no data loss. A full recovery requires that services be returned to pre-disruption levels, including for throughput performance.
- Diagnose – This post-incident or post-mortem imperative involves identifying the root cause(s) of the disruption. It is also possible to do this as “thought experiments” conducted on a continuous basis to identify potential issues.
- Refine – Lessons learned during and after disruptions should be catalysts for making any improvements that might be needed in the defence, detection, remediation and recovery imperatives. In effect, this is the feedback loop that underlies the IT resilience framework by making it possible to continuously refine business continuity plans.
To Defend against downtime that might be caused by failures requires implementing some form of high availability (HA) and/or disaster recovery (DR) provisions for all applications, and these consist in the Detect, Remediate and Recover imperatives. The remaining two imperatives (Diagnose and Refine) are used post-incident to inform any changes that might be needed in the HA and/or DR protections.
The requirements for Detect, Remediate and Recover are usually determined by each application’s Recovery Time and Recovery Point Objectives. Applications that need to recover quickly with little or no data loss have stringent RTOs and RPOs, and therefore, need robust HA protection with automatic failover. Those applications that can tolerate some downtime and data loss may require only DR protection, which generally involves manual processes to Detect, Remediate and Recover from failures and widespread disasters. Depending on the HA provisions, protection against widespread disasters may also require separate DR provisions.
Resiliency options available in the cloud
All clouds, including Microsoft Azure, Amazon Web Services (AWS) and the Google Cloud Platform (GCP), have state-of-the-art infrastructures with various standard and optional ways to maximise the availability of services. All cloud service providers (CSPs) also have service level agreements (SLAs) that offer money-back guarantees for uptime falling short of a designated percentage, usually ranging from 99.0 per cent to 99.99 per cent. Four-nine’s of uptime is generally accepted as constituting HA, and configurations eligible for these 99.99 per cent SLAs must span multiple availability zones or regions to protect against failures at the datacentre level.
But be forewarned: The SLAs only guarantee “dial tone” at the server level, and explicitly excluded many causes of downtime at the application level. So while it is advantageous to leverage the CSP’s infrastructure, additional HA provisions are needed to ensure satisfying the RTOs and RPOs of mission-critical SQL Server databases.
For DR needs, all three CSPs have offerings suitable for most applications. GCP has what could be called DIY (Do-It-Yourself) DR guided by templates, cookbooks and other tools. DIY is a viable option because, compared to HA, DR is relatively easy to implement with data snapshots or replication and “warm” standby instances, all of which are available in every cloud. Microsoft and AWS have managed DR-as-a-Service (DRaaS) offerings: Azure Site Recovery (ASR) and CloudEndure Disaster Recovery, respectively. For all three CSPs it is important to note that at least some manual intervention is required to Detect, Remediate and/or Recover from a failure.
Resiliency options available with SQL server
For SQL Server itself, administrators have a choice of two options for providing HA/DR protection at the application level: Failover Cluster Instances and Always On Availability Groups.
Failover Cluster Instances have been available since SQL Server 7 and afford two notable advantages: FCIs are available in the less expensive Standard Edition; and they protect the entire SQL Server instance, including user and system databases. A significant disadvantage for HA and/or DR is the need for cluster-aware shared storage, such as a storage area network (SAN), to create a common dataset for the active and standby instances. But shared storage has historically been unavailable in the cloud.
The lack of shared storage was addressed in Windows Server 2016 Datacenter Edition with the introduction of Storage Spaces Direct, accompanied by support for S2D in SQL Server 2016. S2D enables creating a virtual SAN where data can be shared among multiple instances. But it requires that the active and standby instances reside within the same datacenter, making this option viable for some HA needs, but not for multi-region DR configurations or HA configurations where nodes reside in different Availability Zones. FCI can still be used for DR purposes, but data replication across the WAN will need to be provided by a third-party replication solution.
Always On Availability Groups is SQL Server’s more robust HA/DR offering, and is capable of satisfying an RTO of 5-10 seconds and an RPO of seconds or less. This feature replaced database mirroring in SQL Server 2012 and is included in SQL Server 2017 for Linux. Its advantages include no need for shared storage and readable secondaries for querying the databases (with appropriate licensing).
Among its disadvantages are the need to license the more expensive Enterprise Edition, which is cost-prohibitive for many applications, and its lack of protection for the entire SQL instance. While a Basic Availability Groups feature was added to the Standard Edition of SQL Server 2016, it supports only a single database per AG. For Linux, which lacks the equivalent of Windows Server Failover Clustering, there is a need for additional open source software or a third-party failover clustering solution.
Third-party SANless failover clusters
This option is purpose-built to provide a complete HA/DR solution for virtually all Windows and Linux applications in private, public and hybrid cloud environments. Being application-agnostic eliminates the need to have different HA/DR solutions for different applications. Being SANless overcomes impediments caused by the lack of shared storage in the cloud, while making it possible to leverage the cloud’s many resiliency-related capabilities, including availability zones and regions.
These software-only solutions include, at a minimum, real-time data replication, continuous monitoring able to detect failures at the application level, and configurable policies for failover and failback. Most also offer a variety of value-added capabilities, including some specific to Linux and popular applications like SQL Server.
Being purpose-built for HA, failover clusters are able to satisfy RTOs as low as 20 seconds with no data loss (an RPO of zero), enabling them to deliver an uptime of 99.99 per cent. One notable disadvantage is the inability to read secondary instances of SQL Server databases. But given their capacity to Detect, Remediate and Recover from failures at the application level, purpose-built SANless failover clustering is becoming increasingly popular option for critical database and other applications running in the cloud.
The top two for SQL server
For mission-critical SQL Server databases running in the cloud, there are really only two options capable of assuring the 99.99 per cent uptime required for high availability: SQL Server’s own Always On Availability Groups and third-party SANless failover clusters. Purpose-built failover clustering solutions afford these six advantages:
- Application-agnostic design provides a universal HA/DR solution
- Works with the less expensive SQL Standard Edition
- Permits distributed transactions and unlimited databases
- Protects the entire SQL Server instance
- Supports more than two nodes in a cluster
- Simplifies implementation and operation
It is true that these options increase both capital and operational expenditures. But weighing that relatively modest increase against the cost of downtime and the savings afforded by the cloud should make a compelling case for implementing or improving IT resiliency for your SQL Server databases.
David Bermingham, Technical Evangelist, SIOS Technology (opens in new tab)
Image source: Shutterstock/everything possible