Start with a simple premise: No system is wholly invulnerable. Things happen – from hurricanes and power failures to a million things we haven’t imagined. While it’s important to consider what could bring down your production SQL Server database, it’s even more important to ensure that you can recover your SQL Server database in a timely and efficient manner. Expecting the unexpected is prudent, but it’s even better to know in advance that you have prepared to respond in the manner best suited to the needs of your organisation.
Whether you are running your SQL Server systems on-premises, in the cloud, or in a hybrid configuration that spans both, a range of options exist to enable you to recover your SQL Server database and get back online. But with each option there are trade-offs – in recovery time, in potential for data loss, in operational cost, and in operational complexity. With better insight into the trade-offs, you’ll be in a better position to make an informed decision now. And that’s important because you need to make that decision in advance of the day when that decision will really matter.
Four approaches to SQL Server disaster recovery
Disaster recovery (DR) is all about that: recovering from a blow that has taken your primary SQL Server production system offline. We’re talking here about something much more consequential than a momentary glitch or an unexpected server reboot. Here we’re talking about a scenario in which, for example, the physical server running your primary SQL Server instance has a catastrophic failure or your data centre loses power for an extended period due to a fire or flood.
In such circumstances, your ability to recover your production SQL Server system and make it operational somewhere else – likely in another data centre – will depend on the steps you’ve taken to ensure the readiness of backup systems. You need to have backup servers upon which to run an instance of SQL Server, current copies of the primary databases themselves with which those servers can interact, and a mechanism that can both detect that the primary SQL Server instance has gone offline and trigger a process by which the backup servers are brought online and identified as the new primary servers supporting your production SQL Server system.
There are four key approaches to meeting these needs, but each delivers results in different ways and with different trade-offs. Before we dive into those details, though, it’s useful to start off with what these approaches have in common.
- All require you to maintain backup servers in at least one other geographically separated location. Two backup servers (in two distinct locations) is even better, in case one of the backup servers also fails. These could reside in other data centres you own, if you’re configuring on-premises, or they could reside in remote cloud data centres.
- All approaches require you to maintain versions of SQL Server on your backup servers. Here there are cost and licensing considerations that you may need to keep in mind, depending on how you have acquired your licenses for SQL Server.
While the need to maintain backup servers and SQL Server instances is common across all four disaster recovery scenarios, they affect different approaches to disaster recovery in different ways, which we’ll touch on in turn.
Two other thoughts that you’ll want to keep in mind. Determining which DR approach is best for your organisation will ultimately require you to have determined your recovery time objective (RTO) and recovery point objective (RPO). The recovery time objective is relatively easy to explain: How quickly must your SQL Server system be operational after a disaster? That’s your RTO. The recovery point objective is a little more difficult to quantify: How much data are you willing to lose during the process of bringing a new instance of SQL Server online in a remote data centre? Naturally, no one wants to lose any data, so understanding the degree to which you may lose data during recovery is important.
Stated simply, log shipping is a process through which the log files recording every transaction taking place in a SQL Server database are saved to the remote sites where backup servers (physical or virtual) and instances of SQL Server are located. The log files don’t contain the data in the database; they simply present a record of all the changes made to that database. They can act as a kind of transaction map that can be used to take an earlier instance of a database (last night’s backup copy, for example) and roll it forward to the point reflected in the last log record.
There are specific advantages to log shipping. It’s inexpensive, as SQL Server maintains the log files on its own. It’s a useful approach if you’re bandwidth-constrained because you don’t need to maintain identical copies of your databases in multiple locations in real-time. You can use the log files to roll forward last night’s backup, even if there are multiple hours’ worth of more recent transactions that are not in that backup copy.
But that points to the trade-offs associated with log shipping: Depending on the volume of transactions that have to be made to your SQL databases since the copy you’re using was created, it can take a significant amount of time to roll one or more databases forward based on the log files. If your primary server is offline and you are rolling your backup database forward, your production SQL Server environment is going to be offline. Log shipping can also be a complicated manual process because there’s no automated failover to the backup environment in a log shipping scenario. You need to manually reconfigure your production environment to point users to the backup systems running SQL Server. Failing back to your primary server can be tricky, too, because you need to reconfigure log shipping in reverse and reseed your data back to your original database configuration.
For all these reasons, if your RTO is to have your production environment online in less than a few hours, log shipping may not meet your organisation needs.
The failover cluster approach involves configuring multiple servers in multiple locations and typically involves a shared storage area network (SAN) on which one or more SQL Server databases reside. This collection of servers and storage services is known as a failover cluster instance or FCI.
One thing that distinguishes an FCI is Microsoft Failover Cluster Manager, which is built into Windows Server and provides services that monitor the health of the FCI and automate failover of the SQL Server production environment to one of the remote backup servers if it detects that the primary SQL Server instance has failed.
The use of FCIs has been the backbone of DR for on-premises deployment of SQL Server for years, and it offers two significant benefits. The automation enabled by Failover Cluster Manager ensures that failover takes place quickly and without extraordinary manual intervention (as encountered in log shipping). Moreover, the use of a shared SAN enables any backup server in the FCI to begin interacting with the SQL databases as soon as the automated failover process itself is complete and users can interact with the newly identified production servers. There’s no time lost rolling an older instance of a database forward, as in log shipping, because the backup servers in the FCI can access the same database in the SAN that the primary server had been using before it went offline. In terms of RPO, you effectively lose nothing.
The trade-offs? Failover clustering was originally designed for on-premises deployment, which means that it requires an investment in hardware, network, and human resources. Moreover, if you want to use the cloud, you can’t create the kind of FCI with shared storage as I’ve just described because you can’t share storage in the same way in the cloud. There are options for SANless clustering, as you’ll see in the next two sections.
The availability group (AG) approach builds on failover clustering but takes out the SAN component, which makes it a viable alternative to classic failover clustering if you want to run SQL Server in the cloud. AGs take advantage of Windows Failover Manager to monitor the health of the servers and automate failover in the event of a disaster. However, instead of sharing storage, AGs rely on local storage attached to each physical or virtual server. To ensure access to the SQL databases themselves, AGs provide services that can replicate the SQL databases (synchronously or asynchronously) from the primary production environment to one or more secondary/backup servers. If the primary instance of SQL Server goes offline, AG can automate failover to a secondary server, which can begin meeting your user’s needs with its local copy of the databases from the primary server.
But, yes, there are trade-offs. You can configure an AG to run on virtual machines (VMs) in the AWS or Azure cloud, so you can avoid the large investments in infrastructure associated with FCIs. And, if you have only a handful of databases and one backup server, you can run a Basic AG using the Standard edition of SQL Server 2012 or later, which can further help reduce costs. However, if you have more than a few databases – which is most commonly the case -- or you want to involve more than one backup server (always a good idea for disaster recovery), you’ll need to use the Always On AG feature that comes with the significantly more costly Enterprise edition of SQL Server 2012 or later.
There’s another trade-off inherent in the AG approach: AGs replicate only the user-defined SQL Server databases. The master database (MSDB), the database of agent jobs, logins, and passwords—none of these are replicated to the secondary VMs. Not having these databases locally could create an RPO problem if you continue to use the backup servers as your primary production instance of SQL Server.
A fourth approach to DR involves the creation of a SANless FCI – on premises, in the cloud, or in a hybrid of both – in which the entire set of SQL databases is replicated to all the backup servers that might be called into service in the event of a disaster. If the primary SQL Server instance fails, the FCI manager fails over to a secondary server, which can immediately begin working with its copy of the primary SQL database. Unlike the AG approach, though, the replication mechanism in a SANless FCI does block-level replication of whatever data and tables are in the storage system, ensuring that all secondary servers have complete copies of all the primary databases.
The benefits of a SANless cluster approach to replication are many: Unlike the AG approach, you can use any edition of SQL Server or Windows Server and you can replicate all your databases. No need for costly upgrades to SQL Server Enterprise Edition. Unlike the classic FCI approach, you can deploy a SANless cluster in the cloud or on-premises (or both). You can even use it to meet both your RTOs and your RPOs. You can also apply high performance SSD to local storage for faster performance for a fraction of the cost of a SAN.
The trade-offs? You’ll need to license a third-party product that integrates with Windows FCI to create a SANless cluster. Creating and managing SANless clusters is easier than learning how to reconfigure your production environment on the fly (as you would in a log shipping scenario) and the SANless cluster management tools work with Windows Failover Cluster Manager to facilitate cluster management. This integration also ensures that the SANless cluster can benefit from the automation services provided by Failover Cluster Manager so you can failover rapidly to a backup server in a disaster. When that happens, the backup server can begin working right away with its local copy of the SQL database.
Dave Bermingham, Senior Technical Evangelist, SIOS Technology