Even brief periods of downtime for SAP applications can have devastating consequences for business operations, including production and supply chain interruption, delivery delays, lost sales, data loss, payroll delays, customer service degradation, contractual liabilities and regulatory fines.
Unfortunately, SAP offers little or no native support for the server clustering, data replication and failover recovery functions needed for assuring high availability and data protection. It is equally unfortunate that most IT organizations also lack the staffing needed to even consider using open source high availability software in a Linux environment.
Fortunately, the growing popularity of SAP and Linux have given rise to a variety of viable alternatives for keeping SAP applications highly available and operating properly through the usual hardware and software failures, administrator errors, routine maintenance, and even site-wide disasters. The challenge is finding which one affords the best solution.
In this article I outline five best practices that together ensure being able to fully protect SAP applications in a Linux environment using an integrated approach that is utterly dependable, while also being cost-effective, and simple to implement and operate.
1. Simplify failover clustering to get the best results
For many IT departments, creating a cluster to protect SAP, along with other applications and data in a Linux environment, has proven to be a complex, time-consuming and error-prone process. The complexity leads to a high degree of uncertainty about the ability of the applications to failover when necessary and to fully protect all data under all possible circumstances.
The reason for the complexity is that unlike Windows, with its carrier-class high-availability features, the Linux operating system lacks built-in HA provisions. For example, Windows Server Failover Clustering makes it simple to implement, test, monitor and manage HA and disaster recovery provisions for all applications. With Linux, by contrast, HA and DR remain mostly DIY endeavors that require using unsupported open source software like Corosync Cluster Engine and Pacemaker. But getting the full software stack to work as desired requires creating (and testing) custom scripts for each application, and then frequently retesting and updating each script after even minor changes are made to any of the software or hardware being used. The result is failover provisions that routinely fail.
The easiest way to implement HA for Linux is to use a commercial solution, and these can be either storage-based or host-based. Storage-based HA solutions, while once popular, have fallen out of favor owing to their high cost and inherent limitations. These solutions protect data by replicating it within a redundant and resilient storage area network (SAN). But the approach requires the entire SAN infrastructure to be acquired from a single vendor and relies on separate failover provisions to deliver high availability at the application level.
Host-based HA solutions have grown in popularity based on their ability to deliver mission-critical high availability cost-effectively. These solutions create a storage-agnostic SAN-based or SANless failover cluster across Linux server instances. As an HA overlay, the resulting clusters are capable of operating across both the LAN and WAN in private, public and hybrid clouds. While this approach does consume host resources, these are relatively inexpensive and simple to scale in a Linux environment. Shared-nothing SANless clusters are preferred because they eliminate all potential single points of failure.
Most HA SANless failover cluster solutions provide a combination of real-time block-level data replication, continuous application monitoring, and configurable failover/failback recovery policies. Some of the more robust solutions also offer advanced capabilities like wizard-driven ease of configuration and operation, a choice of synchronous or asynchronous replication, WAN optimization to maximize performance, manual switchover of primary and secondary server assignments for planned maintenance, and the ability to perform routine backups without disruption to the application.
2. Ensure the cluster protects the entire SAP environment
Most SANless failover clustering solutions are application-agnostic; that is, they are capable of supporting virtually any application. Some also offer additional capabilities that are unique to specific applications, including SAP.
For example, basic failover clustering software provides simple monitoring to ensure that the primary server, but not necessarily the SAP application, is alive. This rudimentary monitoring does not protect the SAP application from many problems that can cause service disruption and data loss. In addition, while the SAP application itself might failover to a standby server, the full application stack might not, including critical databases and other key operations.
This is why it is important to choose failover clustering software that monitors and protects the entire SAP environment. In addition to monitoring the server, the solution should verify that the SAP application is running, the file shares and databases are mounted and available, and the clients are able to connect. Complete support for SAP requires actively monitoring server hardware and OS software, the SAP Primary Application Server (PAS) Instance, the ABAP SAP Central Service (ASCS) Instance, all back-end databases (Oracle, DB2, MaxDB, MySQL and PostgreSQL), the SAP Central Services Instance (SCS), all file volumes, systems, shares and/or NFS mounts, the IP and virtual IP addresses, the Enqueue and message servers, and the Logical Volumes. Ideally, the solution would be SAP-certified to assure the best available protection.
3. Only failover when absolutely necessary
Initiating a full failover of the SAP application can often be avoided with faster, more appropriate lightweight techniques when available, such as restarting the application on the primary server. Failover clustering software capable of handling various failure scenarios intelligently should make it possible to configure the desired responses in HA policies.
For example, some solutions are able to stop and restart applications both locally and remotely on another cluster at either the same or another site. When a problem is detected, the system can initially and automatically take one of three configurable recovery actions: attempt a restart on the same server; failover to a standby server; or merely alert a system administrator. Combinations of these options may also be possible, such as first attempting a restart while notifying the administrator, and then switching over to the standby server if the restart fails.
In addition to protecting against unplanned downtime, the failover cluster should eliminate the need for any downtime during routine maintenance tasks. Indeed, no planned maintenance, including backups, software updates or hardware upgrades, should ever bring down a mission-critical application. This usually requires manual failover and failback among primary and standby servers, and in failover cluster configurations with three or more servers, high availability is preserved at all times.
4. Fully leverage all available resources
In today’s IT infrastructures, with physical and virtual server, storage and networking resources configured in private, public and hybrid clouds, the failover clustering solution should be able to take advantage of anything and everything that makes HA both more dependable and affordable.
Although I have been discussing the use of SANless failover clusters as a best practice, there may be some applications for which use of a SAN is more cost-effective. The problem with some SANs is that they can create a single point of failure for shared data, and for mission-critical HA, any single point of failure is unacceptable. But for applications that use only relatively static data protected with acceptable recovery point and time provisions, shared SAN or network-attached storage resources might be viable. Supporting these more cost-effective configurations requires the SANless failover clustering solution to also support SANs and NAS.
5. Make HA and DR virtually (or physically) fool-proof
While single points of failure in compute and storage resources might exist within a single site, they cannot in a multi-site configuration. Protecting SAP environments from a site-wide disaster, therefore, requires a cluster that can failover to a remote location, including one provided in a public cloud. The best configurations have clusters with multiple standby physical and/or virtual server instances at multiple sites. As I mentioned above, a configuration with at least three server instances and three sets of replicated data is able to maintain high availability even during periods of planned maintenance when one of the servers must be taken offline.
Such triple redundancy also makes it easier to test HA and DR configurations. In fact, testing failover provisions in a configuration with only two server instances could result in downtime for the application if the tests were to fail in certain ways. A capable SANless failover clustering solution should be able to perform tests on standby server instances without ever adversely impacting on the primary server instance. And rotating the primary among the server instances enables all to be tested to ensure the failover provisions will not fail when actually needed.
Yes, some of these best practices, especially the triple redundancy, will increase the cost of protecting SAP applications. But for mission-critical applications, the cost of high availability is usually eclipsed by the cost of downtime or data loss. And this is why some of these best practices are intended to help minimize costs. For example, simplifying the approach to clustering keeps administrative costs low, while the ability to take full advantage of all available resources enables incorporating the most cost-effective ones in various configurations.
The bottom line of best practices for high availability and disaster recovery should be a better bottom line for the business—all things considered. And no other HA solution for Linux makes more of these best practices available than the SANless failover cluster does.
Jonathan Meltzer, Director, Product Management at SIOS Technology
Image Credit: Hafakot / Shutterstock