High-Performance Linux Clusters

David Morton

Issue #163, November 2007

The present and future of high-performance computing.

Twice a year, a group of scientists in Europe and the United States release a list of the world's 500 most powerful computing systems. The Top 500 list is the most prestigious ranking of its kind, with vendors and users leveraging favorable rankings to promote their work. The most recent list, released June 2007, reconfirmed a recent trend: Linux is by far the most frequently used operating system in high-performance computing (HPC). Consider the numbers: 389 machines (or 78%), run some flavor of Linux, 64 run UNIX, two run Windows and 42 feature a mix of Linux and other operating systems.

Although such dominance suggests that Linux has had a long history in HPC, the truth is that Linux clusters began replacing UNIX systems only six years ago. The reason for such quick initial take-up is due to the fact that Linux and open systems brought commodity hardware and software into what had previously been a proprietary systems market. This change brought costs down significantly, allowing users at the high end to purchase more power at lower cost and opening the door for new users, such as traditional product designers who were not able to afford closed proprietary systems. The domination of Linux in the HPC market is so successful that market research firm IDC estimated Linux represented 65% of the total HPC market by mid-2006 (as compared to approximately 30% for UNIX), with additional growth projected. The Top 500 list confirms that growth.

Challenges and Questions

Linux is clearly the present of HPC, but is it the future? Microsoft continues to make advancements with its Windows Compute Cluster Server, has plenty of cash on hand and is clearly capable, from a business perspective, of eating up market share. In addition, despite its well-known flaws, everyone has worked with and is familiar with Windows, potentially making it a comfortable platform to new HPC users.

Complicating matters further is that, despite their well-earned market dominance, high-performance Linux clusters have, in many cases, earned a reputation for being difficult to build and manage. Widely available commodity components lead to complexity in the selection, integration and testing required when building a stable system. This complexity becomes doubly problematic when you consider that organizations invest in HPC systems in order to get the best possible performance for the applications they run. Small variations in system architecture can have a disproportionately large impact on time to production, system throughput and the price/performance ratio.

Furthermore, like any new technology, the first high-performance Linux clusters hit bumps in the road. Early systems took a very long time for vendors to build and deliver and an even longer time to put into production. Additionally, early management software made re-provisioning systems and upgrading components cumbersome. Finally, delivering HPC systems is as much about understanding the nuances of computer-aided engineering (CAE) applications as it is about understanding technical minutiae related to interconnects, processors and operating systems. Early vendors of high-performance Linux clusters did not necessarily have the expertise in computational fluid dynamics (CFD), finite element analysis (FEA) and visualization codes of proprietary systems vendors.

It is, therefore, natural for many to question whether the tremendous price advantage of Linux and open systems still outweighs all other considerations. The truth is that although Windows provides some advantages to entry-level HPC users, high-performance Linux clusters have matured. Today's Linux clusters deliver better performance at a more attractive price than ever before. Clusters are increasingly being demanded as turnkey systems, allowing faster time to production and fewer management headaches. In addition, the very nature of open source has contributed to the strength of high-performance Linux clustering. Linux clusters adapt more quickly to new technology changes, are easier to modify and optimize and benefit from a worldwide community of developers interested in tweaking and optimizing code.

The Advantages of Linux-Based HPC

The most important factor in HPC is, of course, performance. National laboratories and universities want ever-more powerful machines to solve larger problems with greater fidelity. Aerospace and automotive engineering companies want better performing systems in order to grow from running component-level jobs (such as analyzing the stress on an engine block) to conducting more complex, multi-parameter studies. Product designers in a variety of other fields want to graduate from running CAE applications on their relatively slow workstations in order to accelerate the overall design process.

Performance, therefore, cannot be separated from high-performance computing and in this area, Linux clusters excel. There are two primary reasons for this: maturity and community.

Maturity

With years of experience under their belts, vendors and architects of high-performance Linux clusters are better equipped than ever to design stable, tuned systems that deliver the desired price performance and enable customers to get the most out of their application licenses.

First-generation systems may have been difficult to manage, but the newest generation comes equipped with advanced cluster management software, greatly simplifying operations. By selecting an experienced vendor, many of today's clusters are delivered as full-featured systems as opposed to an unwieldy pile of stitched-together commodity components. As a result, users benefit from both lower acquisition costs and easy-to-use high-performance systems.

The maturity of the Linux HPC industry also contributes to a deeper understanding of the codes users rely on, as well as the hardware that goes into building a system. Certain vendors have become experts at tuning systems and optimizing Linux to meet and overcome the challenges posed by widely used HPC applications. For example, most high-performance structures codes, such as those from ANSYS or ABAQUS, require high I/O to sustain higher rendering rates. Conversely, crash/impact codes don't require much I/O to run optimally; they are designed to run in parallel in systems where the average CPU count is 16. Linux has evolved to the point where it is now very easy for vendors to build systems that accommodate the needs of these codes—even within the same cluster.

Alliant Techsystems (ATK) is a recent example of how high-performance Linux clusters have matured. ATK is an advanced weapon and space systems company with many years of experience working with HPC systems. In 2006, faced with upgrading its aging proprietary system, the launch system's group invested, after extensive benchmarking, in a high-performance Linux cluster—finding one tuned and optimized for CFD, FEA and visualization codes. The decision reflected their understanding that Linux clusters—and vendors—had matured.

“We had heard several horror stories of organizations that moved to Linux supercomputers, only to suffer through installation times that stretched to six or eight months and beyond”, said Nathan Christensen, Engineering Manager at ATK Launch Systems Group. “For instance, one of ATK's other business units experienced eight weeks of waiting and downtime to get a system into production. The Launch Systems Group wanted to avoid a similar experience.”

“The system arrived application-tuned, validated and ready for production use”, said Christensen. “We were able to move quickly into full production, generating our simulations and conducting our analysis within two weeks of delivery.”

The system also accelerated the company's time to results, thereby enabling ATK to complete designs faster and conduct more frequent, higher-fidelity analysis. The launch system's group completes runs three to four times faster than before. In addition, on some of its key CFD and FEA applications, ATK has been able to achieve ten times the throughput performance.

Community

The greater Linux community is also an important factor in assuring that Linux-based systems deliver the greatest performance. The benefit of being open source means that users and vendors from around the world continue to develop innovations and share them with the greater community. This enables Linux-based HPC systems to adapt more quickly to new hardware and software technologies. As a result, the ability to take advantage of new processors, interconnects and applications is much greater than with proprietary systems.

Additional Benefits

High-performance Linux clusters offer a range of benefits beyond raw application performance.

First, Linux is well known for its ability to interoperate with all types of architectures and networks. Because of the investment in HPC systems, users want to make certain that their systems are as future-proof as possible. Linux provides users with an operating system that is flexible enough to accommodate virtually any future advancement. This is further amplified, of course, when the larger Linux community, working together to solve common problems, is again taken into question. In addition, a variety of tools, such as Samba, allow Linux to share file services with Windows systems, and vice versa.

Second, Linux clusters evolved without headless operations. As a result, administrative tools are able to install and manage the system as a whole, rather than as individual workstations or servers. These tools continue to get easier to use, enabling users with limited technical skills to jump quickly into HPC. To take just one example, Linux Networx recently launched its newest cluster management application, Clusterworx Advanced. This application provides system administrators with intuitive tools that greatly simplify operations and reduce administration workload.

Third, Linux-based clusters are easy to scale due, in part, to newer filesystems, such as GPFS and Lustre, which provide better scalability, but only on Linux and UNIX. Windows-based filesystems are typically tuned for file sharing and don't provide the type of performance and accessibility required when lots of compute nodes all request the same dataset at the same time.

Fourth, resource management tools, such as Altair's PBS Pro and Platform LSF, ensure that computing resources are being allocated with utilization rates exceeding 90%. Without proper resource management, systems tend to work only when the engineering team works, thereby limiting overall utilization. With mature resource management tools, such as those available for Linux-based HPC systems, jobs can be scheduled 24 hours a day, 365 days a year. Multiple jobs can be run simultaneously, as needed, thereby ensuring excess power is always put to use.

Fifth, from a stability perspective, Linux—due to its flexibility and the number of people working on refining it—is significantly more stable and scalable than other platforms. Windows, for instance, is prone to failure at moderately larger node counts and is not considered as an option at government and national laboratories.

Sixth, the nature of open source makes Linux the most convenient platform for vendors and users to work with. Standards are broadly defined and supported by a worldwide community of programmers—rather than the diminishing numbers found at the remaining proprietary vendors. As a result, there are no shortages of fully developed tools, utilities and software modifications that users and vendors can leverage in order to optimize their systems.

Conclusion

The HPC market has made its choice, and the choice is the Linux OS due to its superior performance, lower costs and its community of open-source developers and vendors. Windows may have a lot to offer entry-level users, especially those with more limited resources or goals. Likewise, UNIX still has a lot to offer for many legacy HPC applications. However, both Windows and UNIX require more work to be able to deliver the same functionality and compelling price-performance value of Linux. The HPC market is more open and competitive than it has ever been, but it is clear that Linux is still the best choice for today and the foreseeable future.

David Morton brings 17 years' experience in supercomputing in both vendor and end-user roles to Linux Networx. Dave is responsible for leading the Linux Networx technology vision and directing the hardware, software and system engineering teams. Previously, Dave served as technical director for the Maui High Performance Computer Center where he was responsible for the definition and oversight of technology for this Department of Defense supercomputing center. He held several positions at SGI, including director of server I/O and continuation engineering and director of Origin platform engineering. Dave's experience includes eight years at Cray Research, where he was named inventor on three issued patents. Dave earned a Master's degree in Mechanical Engineering from the University of Illinois and an MBA from the University of Minnesota.