Skip to main content

Strengthening the availability chain

Multicloud
(Image credit: Everything Possible / Shutterstock)

What do you think of first when thinking about ensuring the high availability (HA) of your most important applications and data? If you or your customers need to be able to access those applications 99.99 percent of the time, it’s natural to think first about ensuring access to the compute and storage resources. If you’re running SQL Server in the cloud, for example, you can configure a Windows Failover Cluster Instance (FCI) to respond to the failure of compute or storage resources by automatically moving the compute and storage loads to an alternate node of the failover cluster. HA problem solved!

But what if it’s not the compute or storage resources that fail? There are many links in the availability chain connecting you and your customers to those compute and storage resources. You need to consider all those links to ensure the HA experience you are striving to achieve. 

Network availability

If you’re running your critical applications in the cloud, your cloud service provider is going to ensure the availability of the intranet connecting the components of your cloud infrastructure. AWS, Azure, and Google Cloud Platform all provide high speed, robust internal networks with multiple paths, so the core cloud networks are fully capable of supporting your 99.99 percent HA goal. 

You can’t control how your customers connect to your cloud-based applications, but you can control how you connect to them. You might be using a VPN Gateway or a dedicated connectivity service such as Azure ExpressRoute, AWS Direct Connect, or Google Direct Interconnect. All these options can provide you with a high-speed, low latency connection to the cloud, but they all offer different SLAs—and several of them expose weak links in the availability chain. The basic configuration of Azure ExpressRoute offers only a 99.95 percent availability guarantee; the basic configuration of AWS Direct Connect is even lower–only 99.9 percent. If either service fails unexpectedly, access to your critical applications could be constrained for far longer than you are expecting. Indeed, the VMs configured for HA in the Azure or AWS clouds may continue to run without interruption—but that’s cold comfort if you cannot access them because ExpressRoute or Direct Connect is down. 

You can configure Azure ExpressRoute or AWS Direct Connect for HA; it just takes planning. You’ll need to configure at least two ExpressRoute circuits and four Direct Connect circuits to gain an SLA of 99.99 percent. If you’re using the analogous services on GCP, you’ll want to use the Google Direct Interconnect Service for Production-Level Applications rather than the Google Direct Interconnect Service for non–critical Applications to get the 99.99 percent SLA. 

Infrastructure availability

Even if you strengthen the weak links in the network, though, there remain potential weak links within the cloud infrastructure itself—among load balancers, DNS servers, identity and authentication servers, web server farms, and the like. Remember the very public outage at Facebook in October of 2021? Outages affecting access to Facebook’s internal DNS servers—not the production systems supporting Facebook’s primary lines of business—were responsible for bringing down the entire organization for hours. You need to look at these components of your overall infrastructure as well to ensure that you’re fully configured for HA.

Google’s SLA for DNS server services is 100 percent, which is encouraging, but its SLA for Cloud Identity services is only 99.9 percent. Similarly, AWS’s Route 53 private DNS service strives to offer a 100 percent SLA, but its Directory Services offering tops out at 99.9 percent. The Azure Active Directory Basic and Premium Services offer a 100 percent SLA, but the SLA for Azure Active Directory Domain Services tops out at 99.9 percent.

As with network connectivity, there are things one can do to improve the reliability of the internal infrastructure supporting your critical cloud-based applications. For example, you can configure your AWS environment with multiple domain controllers, which can boost the reliability of the AWS Directory Services offering closer to the 99.99 percent accessibility levels you seek.

The multi-cloud option

There are times, though, as in the seven-hour AWS outage of December 7, 2021, where even the most prepared organizations may encounter unexpected downtime. In the case of the AWS outage, the issues stemmed not from systems that customers were using but, as AWS notes, from errors occurring on an internal network designed “to host foundational services, including monitoring, internal DNS, authorization services, and parts of the EC2 control plane.”* Indeed, in many cases the VMs upon which customer applications were running remained operational and fully compliant with HA SLAs—yet customers could not access their applications because of issues with gateways, internal DNS services, load balancers, and other components whose ability to operate properly was compromised by the cascading effects of the errors occurring on the internal network. 

How can your applications remain operational and accessible when the weak link in the availability chain turns out to be the cloud itself? Your best option here is to rely on a multi-cloud disaster recovery (DR) solution. Essentially, you would create a mirror infrastructure to support your most vital applications in an entirely separate cloud. If your critical SQL Server infrastructure runs on AWS, for example, you would create an identical instance of SQL Server on Azure or GCP, an instance you could start up manually if the AWS cloud went offline. You will want to select a DR management solution that runs in both the AWS and Azure/GCP environments and that can automatically orchestrate the replication of data from the SQL Server instance in AWS to storage attached to the infrastructure in your Azure/GCP cloud environment. If you don’t deploy the same DR management solution in both environments, you may not replicate your data properly between the clouds.

You’ll also want to configure a high-speed virtual private network (VPN) connection between your primary and DR infrastructures. AWS, Azure, and GCP all offer VPN services that can enable a secure cloud-to-cloud connection (and there are third-party options as well), and this becomes the conduit through which your DR management solution replicates your critical data between the cloud infrastructures. Yes, if you were using an AWS VPN Solution in December it might have gone offline during the outage – but in this case that's okay. The DR management solution running on AWS replicates all the local write operations to its storage counterpart in the DR infrastructure as quickly as the network will allow, so by the time the AWS services went offline the DR software would have replicated all (or nearly all) of the critical AWS data to the DR infrastructure. As soon as it was apparent that the primary cloud had gone offline, you would spin up the infrastructure in the DR cloud and it could begin providing customer access to your critical applications with minimal disruption. You may not be up and running in the sub-five minute timeframe you expect of an HA solution, but you would be operational far faster than you would be if you’d had to wait for seven hours for AWS to get its operations back online.

Application availability

Ultimately, configuring for HA is all about configuring to ensure the high availability of your application. You can create FCIs that will ensure the HA of your VMs and storage without difficulty. All cloud service providers are accustomed to accommodating you at that level. For true end-to-end HA, though, you need to pay extra attention to all the other links in the availability chain. Some will be weaker than you realize unless you take extra steps to strengthen them. 

Dave Bermingham, Senior Technical Evangelist, SIOS Technology

Dave Bermingham is the Senior Technical Evangelist at SIOS Technology. He holds numerous technical certifications and has been elected a Microsoft MVP for both Clusters and Cloud & Datacenter Management.