LifeKeeper

Sean Tierney

Issue #132, April 2005

Having experimented with high-availability, open-source solutions and having used other commercial packages, I found LifeKeeper for Linux version 4.4.3 to be a good product.

LifeKeeper for Linux is a high-availability clustering software package from Steeleye Technology, Inc. Steeleye acquired LifeKeeper when NCR spun off the technology, originally developed by AT&T Bell Labs. Steeleye ported LifeKeeper to Linux as well as to other operating systems. Version 4.4.3 supports failover for communications resources, databases, filesystems and mail, print and Web servers.

Steeleye refers to the type of high availability provided by LifeKeeper as fault resilience, the ability to recover from a failure automatically. This is differentiated from the idea of fault tolerance, where the system continues to operate after a failure occurs.

LifeKeeper is supported on various Linux distributions, including Red Hat, SuSE, UnitedLinux and Miracle Linux. The minimum system requirements for LifeKeeper are a supported Linux distribution running on an Intel-based server, 64MB of RAM and approximately 10MB of local disk space. Data protection is achieved by using either shared storage with SCSI or Fibre Channel or non-shared storage using LifeKeeper Data Replication.

The LifeKeeper software contains of a set of core applications and is extended by application-specific recovery kits (ARKs). The installation support and core applications package installed the software base. This included binaries and configuration files for the graphical and command-line interfaces, recovery support for the operating system, filesystems, SCSI subsystem, processor, memory, IP address and raw I/O. It also included an on-line help system and man pages. Application recovery kits are available for Apache Web server, data replication, IBM DB2, Informix, Logical Volume Manager, MySQL, NAS, NFS, Oracle, PostgreSQL, print services, SAMBA, SAP and Sendmail.

The software is licensed per server and per recovery kit. A cluster of two servers requires two licenses for the core application and two additional licenses for each of the application recovery kits. For instance, to protect a pair of LAMP Web application servers, licenses are required for the core application, plus Apache and MySQL application recovery kits. Although licensing costs can mount up quickly, it does allow you to pay for only what you need.

I began my review of LifeKeeper for Linux by reading the product documentation, taking the on-line tutorial and attending a Web-based seminar. This is a well-documented product. The CD-ROMs I received from Steeleye contained a planning and installation manual, a configuration guide and manuals for each of the application recovery kits. The documentation was available on the Web as well as in PDF format. The on-line tutorial was fairly basic and covered the same information as the manuals.

The seminar consisted of a marketing presentation and a live demonstration of LifeKeeper. I felt that the presentation and demonstration would be useful to anyone starting to look into the product. If you're looking to introduce LifeKeeper into your business, it may be useful to have managers and coworkers attend the seminar. The live question-and-answer session was the best part. I encourage anyone interested in the product to review the tutorial and on-line documentation and compile a list of questions to submit during the seminar.

Some flexibility exists in the cluster configuration, so it is a good idea to spend some time considering what hardware, applications and services you want to protect. As a minimum, you should consider server hardware, storage options, communications path, failover model, protected applications and services. Steeleye is positioning LifeKeeper as a commodity product. As such, it should support most reasonable server configurations. Nevertheless, they have certified some hardware and provide guidelines for verifying LifeKeeper with uncertified hardware. Certified hardware vendors include Dell, HP and IBM. In fact, you can include the LifeKeeper software when purchasing systems from them.

Multiple storage options are available to choose from. Shared storage consists of a SCSI or Fibre Channel array that is connected to both systems in the cluster. Data is located on the shared array. LifeKeeper's locking mechanism prevents the standby system from accessing the partition while the active system is in service. The data-replication option enables data stored on the local disks of one system to be mirrored to another system. The network-attached storage option facilitates the use of volumes mounted from an NFS server or NAS device. For instances in which the data is static, such as Web servers, there is an option to not share or replicate the data store.

A central concept of LifeKeeper, as with most high-availability solutions, is the system heartbeat. One server sends a signal to the other to determine system and application health. Heartbeat communication path options include serial port and LAN. It is a good idea to use multiple paths, such as serial and LAN or multiple LAN connections. The failover models include active/active, active/standby and N+1. In active/active configuration, each server in the cluster is providing its own set of applications and services. If one fails, the other takes over. Users may experience some degradation of services, because the remaining system is serving both sets of applications and services, although it does allow for maximum resource utilization.

Active/standby provides the best continuity of service after a failure. However, it requires a redundant system and the associated cost. In N+1 configuration, one standby system provides failover protection for multiple active systems. This configuration provides reasonable utilization of resources while minimizing cost. If multiple failures should occur, users still may experience some increase in response time. Alternately, other active servers could be configured to take over. As previously mentioned, LifeKeeper offers failover protection for a variety of system components, services and applications. More information and documentation is available for each of the application recovery kits on the Steeleye Web site.

The first test scenario was a pair of servers running Linux, Apache, MySQL and PHP, serving up several Web applications. The hardware configuration I used was a cluster of two servers with dual network cards. I connected one NIC (eth0) on each server to the LAN; the second NICs (eth1) were connected to each other using a crossover cable. I connected the serial ports (ttyS0) with a null modem cable. I installed and tested the operating system, applications and supporting software before installing LifeKeeper. This is the recommended procedure, although the software could be installed after LifeKeeper.

During my first pass at installing LifeKeeper, I was running a custom kernel. Consequently, the Data Replication and NFS Recovery Kits were not installed. However, the installation guide provides instructions for patching your kernel and modules as needed. Later, I rebuilt the system and used a default kernel. No glitches occurred while running the installation support setup, installing the core applications and recovery kits. I used the LifeKeeper GUI to set up the communication paths for the heartbeat and to protect the Web application. Command-line procedures are available as well. The manual has step-by-step instructions for each phase of the setup and configuration, but the process is fairly intuitive. I tried several other configurations, including shared storage and legacy systems.

Once the software was installed and configured and I had tested all of the protected applications to ensure they were working properly, I ran several failover tests. I used the GUI to failover manually from one server to another and back again. This is the procedure that would be used to take a protected system out of service for maintenance. The other failures I induced included killing and shutting down protected services, shutting down and removing cables from the network interfaces and heartbeat communication paths and shutting down and pulling the power cord from a protected system. Manually taking a system out of service produced the quickest change over. Failover due to one of the faults I induced, however, was not as prompt. Failover from the active to standby system was quick but not immediate. A system administrator who might be watching the systems closely or a user who happened to be accessing the application when a fault occurred would notice a momentary pause in service. Depending on the type of application or service provided, this may not be a problem. Overall, I found the performance for failover and restoration of services to be adequate and consistent across all of my tests.

Having experimented with high-availability, open-source solutions and having used other commercial packages, I found LifeKeeper for Linux version 4.4.3 to be a good product. It is well documented and the software is comparatively easy to install and configure. Application recovery kits are available for most situations. Additionally, a generic recovery kit and a software development kit are available for those few cases not covered. The technical support is knowledgeable and helpful, and the cost is reasonable. Anyone in the market for a high-availability solution definitely should consider this product.

Sean Tierney is a graduate student at the University of Washington and a systems programmer working with UNIX and LANs. When not obsessed with a new computer project, he enjoys spending time with his wife, son and dogs on their dandelion ranch south of Seattle. He welcomes your comments sent to reviews@prnkstr.com.