Taming the Beast

Jason Allen

Issue #191, March 2010

The right plan can determine the difference between a large-scale system administration nightmare and a good night's sleep for you and your sysadmin team.

As the appetite for raw computing power continues to grow, so do the challenges associated with managing large numbers of systems, both physical and virtual. Private industry, government and scientific research organizations are leveraging larger and larger Linux environments for everything from high-energy physics data analysis to cloud computing. Clusters containing hundreds or even thousands of systems are becoming commonplace. System administrators are finding that the old way of doing things no longer works when confronted with massive Linux deployments. We are forced to rethink common tasks because the tools and strategies that served us well in the past are now crushed by an army of penguins. As someone who has worked in scientific computing for the past nine years, I know that large-scale system administration can at times be a nightmarish endeavor, but for those brave enough to tame the monster, it can be a hugely rewarding and satisfying experience.

People often ask me, “How is your department able to manage so many machines with such a small number of sysadmins?” The answer is that my basic philosophy of large-scale system administration is “keep things simple”. Complexity is the enemy. It almost always means more system management overhead and more failures. It's fairly straightforward for a single experienced Linux sysadmin to single-handedly manage a cluster of a thousand machines, as long as all of the systems are identical (or nearly identical). Start throwing in one-off servers with custom partitioning or additional NICs, and things start to become more difficult, and the number of sysadmins required to keep things running starts to increase.

An arsenal of weapons in the form of a complete box of system administration tools and techniques is vital if you plan to manage a large Linux environment effectively. In the past, you probably would be forced to roll your own large-scale system administration utilities. The good news is that compared to five or six years ago, many open-source applications now make managing even large clusters relatively straightforward.

Monitoring

System administrators know that monitoring is essential. I think Linux sysadmins especially have a natural tendency to be concerned with every possible aspect of their systems. We love to watch the number of running processes, memory consumption and network throughput on all our machines, but in the world of large-scale system administration, this mindset can be a liability. This is especially true when it comes to alerting. The problem with alerting on every potential hiccup is that you'll either go insane from the constant flood of e-mail and pages, or even worse, you'll start ignoring the alerts. Neither of those situations is desirable. The solution? Configure your monitoring system to alert only on actionable conditions—things that cause an interruption in service. For every monitoring check you enable, ask yourself “What action must be taken if this check triggers an alert?” If the answer is “nothing”, it's probably better not to enable the check.

I think it's smart to differentiate monitoring further into critical and noncritical alerts. E-mail and pager alerts should be reserved for things that require immediate action—for example, important systems that aren't pingable, full filesystems, degraded RAIDs and so on. Noncritical things, like NIS timeouts, instead should be displayed on a Web page that can be viewed when you get back from lunch. Also consider writing checks that automatically correct whatever condition they are monitoring. Instead of your script sending you an e-mail when Apache dies, why not have it try restarting httpd automatically? If you go the auto-correcting “self-healing” route, I'd recommend logging whatever action your script takes so you can troubleshoot the failure later.

When selecting a monitoring tool in a large environment, you have to think about scalability. I have seen both Zabbix and Nagios used to monitor in excess of 1,500 machines and implement tens of thousands of checks. Even with these tools, you might want to scale horizontally by dividing your machines into logical groups and then running a single monitoring server per group. This will increase complexity, but if done correctly, it will also prevent your monitoring infrastructure from going up in flames.

Configuration Management

In small environments, you can maintain Linux systems successfully without a configuration management tool. This is not the case in large environments. If you plan on running a large number of Linux systems efficiently, I strongly encourage you to consider a configuration management system. There are currently two heavyweights in this area, Cfengine and Puppet. Cfengine is a mature product that has been around for years, and it works well. The new kid on the block is Puppet, a Ruby-based tool that is quickly gaining popularity. Your configuration management tools should, obviously, allow you to add or modify system or application configuration files to a single system or groups of machines. Some examples of files you might want to manage are /etc/fstab, ntpd.conf, httpd.conf or /etc/password. Your tool also should be able to manage symlinks and software packages or any other node attributes that change frequently.

Regardless of which configuration management tool you use, it's important to implement it early. Managing Linux configurations is something that should be set up as the node is being installed. Retrofitting configuration management on a node that is already in production can be a dangerous endeavor. Imagine pushing out an incorrect fstab or password file, and you get an idea of what can go wrong. Despite the obvious hazards of fat-fingering a configuration management tool, the benefits far outweigh the dangers. Configuration management tools provide a highly effective way of managing Linux systems and can reduce system administration overhead dramatically.

As an added bonus, configuration management systems also can be used as a system backup mechanism of sorts. Granted, you don't want to store large amounts of data in a tool like Cfengine, but in the event of system failure, using a configuration managment tool in conjunction with your node installation tools should allow you to get the system into a known good state in a minimal amount of time.

Provisioning

Provisioning is the process of installing the operating system on a machine and performing basic system configuration. At home, you probably boot your computer from a DVD to install the latest version of your favorite Linux distro. Can you imagine popping a DVD in and out of a data center full of systems? Not appealing. A more efficient approach is to install the OS over the network, and you typically do this with with a combination of PXE and Kickstart. There are numerous tools to assist with large-scale provisioning—Cobbler and Spacewalk are two—but you may prefer to roll your own. Your provisioning tools should be tightly coupled to your configuration management system. The ultimate goal is to be able to sit at your desk, run a couple commands, and see a hundred systems appear on the network a few minutes later, fully configured and ready for production.

Hardware

When it's time to purchase hardware for your new Linux super cluster, there are many things to consider, especially when it comes to choosing a good vendor. When selecting vendors, be sure to understand their support offerings fully. Will they come on-site to troubleshoot issues, or do they expect you to sit for hours on the phone pulling your hair out while they plod through an endless series of troubleshooting scripts? In my experience, the best, most responsive shops have been local whitebox vendors. It doesn't matter which route you go, large corporate or whitebox vendor, but it's important to form a solid business relationship, because you're going to be interacting with each other on a regular basis.

The odds are that old hardware is more likely to fail than newer hardware. In my shop, we typically purchase systems with three-year support contracts and then retire the machines in year four. Sometimes we keep machines around longer and simply discard a system if it experiences any type of failure. This is particularly true in tight budget years.

Purchasing the latest, greatest hardware is always tempting, but I suggest buying widely adopted, field-tested systems. Common hardware usually means better Linux community support. When your network card starts flaking out, you're more likely to find a solution to the problem if 100,000 other Linux users also have the same NIC. In recent years, I've been very happy with the Linux compatibility and affordability of Supermicro systems. If your budget allows, consider purchasing a system with hardware RAID and redundant power supplies to minimize the number of after-hours pages. Spare systems or excess hardware capacity are a must for large shops, because the fact of the matter is regardless of the quality of hardware, systems will fail.

Backups

Rethink backups. More than likely, when confronted with a large Linux deployment, you're going to be dealing with massive amounts of data. Deciding what data to back up requires careful coordination with stakeholders. Communicate with users so they understand backup limitations. Obviously, written policies are a must, but the occasional e-mail reminder is a good idea as well. As a general rule, you want to back up only absolutely essential data, such as home directories, unless requirements dictate otherwise.

Serial Console Access

Although it may seem antiquated, do not underestimate the value of serial console access to your Linux systems. When you find yourself in a situation where you can't access a system via SSH or other remote-access protocol, a good-old serial console potentially could be a lifesaver, particularly if you manage systems in a remote data center. Equally important is the ability to power-cycle a machine remotely. Absolutely nothing is more frustrating than having to drive to the data center at 3am to push the power button on an unresponsive system.

Many hardware devices exist for power-cycling systems remotely. I've had good luck with Avocent and APC products, but your mileage may vary. Going back to our “keep it simple” mantra, no matter what solution you select, try to standardize one particular brand if possible. More than likely, you're going to write a wrapper script around your power-cycling utilities, so you can do things like powercycle node.example.com, and having just a single hardware type keeps implementation more straightforward.

System Administrators

No matter how good your tools are, a solid system administration team is essential to managing any large Linux environment effectively. The number of systems managed by my group has grown from about a dozen Linux nodes eight years ago to roughly 4,000 today. We currently operate with an approximate ratio of 500 Linux servers to every one system administrator, and we do this while maintaining a high level of user satisfaction. This simply wouldn't be possible without a skilled group of individuals.

When hiring new team members, I look for Linux professionals, not enthusiasts. What do I mean by that? Many people might view Linux as a hobby or as a source of entertainment, and that's great! But the people on my team see things a little differently. To them, Linux is an awesomely powerful tool—a giant hammer that can be used to solve massive problems. The professionals on my team are curious and always thinking about more efficient ways of doing things. In my opinion, the best large-scale sysadmin is someone who wants to automate any task that needs to be repeated more than once, and someone who constantly thinks about the big picture, not just the single piece of the puzzle that they happen to be working on. Of course, an intimate knowledge of Linux is mandatory, as is a wide range of other computing skills.

In any large Linux shop, there is going to be a certain amount of mundane, low-level work that needs to be performed on a daily basis: rebooting hung systems, replacing failed hard drives and creating new user accounts. The majority of the time, these routine tasks are better suited to your junior admins, but it's beneficial for more senior people to be involved from time to time as they serve as a fresh set of eyes, potentially identifying areas that can streamlined or automated entirely. Senior admins should focus on improving system management efficiency, solving difficult issues and mentoring other team members.

Conclusion

We've touched a few of the areas that make large-scale Linux system administration challenging. Node installing, configuration management and monitoring are all particularly important, but you still need reliable hardware and great people. Managing a large environment can be nerve-racking at times, but never lose sight of the fact that ultimately, it's just a bunch of Linux boxes.

Jason Allen is CD/SCF/FEF Department Head at Fermi National Accelerator Laboratory, which is managed by Fermi Research Alliance, LLC, under Management and Operating Contract (DE-AC02-07CH11359) with the Department of Energy. He has been working with Linux professionally for the past 12 years and maintains a system administration blog at savvysysadmin.com.