Linux and the Alpha

David Mosberger

Issue #42, October 1997

This is the first of a 2 part series, an introduction to the Alpha family of computers in preparation for giving us the techniques for optimizing code on this high-performance platform in Part 2.

Ever since its announcement in the Fall of 1991, the Alpha architecture (see Reference 3) has been the foundation of the world's fastest systems. In fact, except for one or two brief blips, Alpha systems have been the highest-performing systems based on single-CPU SPECmark performance. With this outstanding performance record comes marketing hype and, sometimes, unrealistic expectations. It is not all that uncommon to find e-mail messages or USENET articles saying things like: “I heard the Alpha is so fast, but now I find that my dusty deck is just 10% faster on the Alpha than on the other system.” So what's the truth? The honest answer is that it depends on what you're doing. Alpha systems are without a doubt fast machines, but it is unreasonable to expect that taking a dusty deck and running it on an Alpha will result in the best possible performance. This is particularly true for programs that were written with the mind-set of the eighties, when CPU cycles were at a premium and memory bandwidth was abundant. Reality looks quite different today: CPU clock-rates above 150MHz are the rule and even laptops can run at 200MHz or more. The result is that, today, the memory system—and not the CPU—is often the first-order bottleneck.

In part 2 of this article, we will demonstrate a few simple techniques that help avoid the memory system bottleneck. Except for one case, the focus is on integer-intensive applications. The topic of optimizing floating-point intensive applications is certainly just as important but, unfortunately, well beyond the scope of this series. The techniques presented can result in tremendous performance improvements. While the techniques will be helpful for all modern systems, they normally extract the biggest benefits on Alpha-based machines. There are a couple of reasons for this bias.

One, the Alpha architecture has been designed with longevity in mind. Specifically, the Alpha architecture should be good for the next 15-25 years, which corresponds roughly to a 1000-fold increase in overall performance. For this reason, some design-tradeoffs were made in favor of long-term viability rather than short-term benefits. For example, the Alpha was right from the start a 64-bit architecture, even though, at the time of its announcement, 32-bit address spaces were considered comfortably large.

Two, the current Alpha implementations are designed to achieve high performance by pushing clock frequency to the limit. This means the CPU-to-memory-system performance gap is the largest for Alpha-based systems. For example, suppose a memory access takes 100ns. On a 500MHz Alpha CPU, this corresponds to 50 clock cycles. In contrast, on a 250MHz CPU, this is only 25 cycles. So the relative performance penalty of accessing memory is much higher on a CPU with higher clock speeds. This may sound like a bad thing, but since the absolute performance is the same, what this really means is that a fast-clock CPU system that is running a memory-bound application will be about as fast as a slower-clock system, but when running a memory-wise application, it will be much faster.

In this part of the series, I present a brief overview of existing and upcoming Alpha implementations. While it is not usually necessary to optimize for a specific CPU, it is helpful to know what the characteristics of current CPUs and systems are. I also discuss a couple of simple performance analysis tools that are available under Linux. When porting legacy code to modern systems, such tools are invaluable, since they avoid wasting time trying to optimize rarely executed code.

Overview of Alpha Family

So far, the Alpha CPU family tree spans three generations; it all began with the 21064 chip. At the time of its introduction, it was the highest performing CPU, and it still makes for a nice workstation, though it's no longer competitive with the latest generation CPUs. This chip branched off into a version that was called the “Low-Cost Alpha” (LCA), also known as 21066 or 21068. The chip core was identical to the 21064 but it had an integrated memory and PCI-bus controller. This high integration made it possible to build Alpha-based systems at relatively low cost and for the embedded systems market. Unfortunately, the design had a major weakness—the memory system was seriously under-powered. This created the paradoxical situation in which a system based on this chip performed on some applications on average, no better than a 100MHz Pentium, but outperformed a P6 running at 200MHz. As a result, the reaction to this chip varied greatly, and probably resulted in quite a few disappointed customers for Digital. On the other hand, there is no doubt that the low-cost at which 21066-based systems eventually were sold caused a quantum leap in the number of Linux/Alpha users.

Around June 1994, the 21164 chip was announced. It had dramatically improved performance over the 21064 and was the first, and so far only, Alpha CPU to feature a three-level cache hierarchy. The first and second-level caches were both on-chip and only the third-level cache was on the motherboard. This chip, in slightly improved versions, is still going strong. At the Fall 1996 Comdex in Las Vegas, such a chip, coupled with a liquid cooling system, was demonstrated running at 767MHz. Another version, called 21164PC, is scheduled to become available around Spring 1997. It omits the relatively expensive second-level, on-chip cache but adds multi-media extensions and other performance-enhancing features. As the name indicates, this chip is designed to be price-competitive with PC processors, specifically the forthcoming Intel Klamath (an improved P6). While price-competitive, the 21164PC is supposed to deliver over 50% better performance than the Klamath. For this second-generation, low-cost Alpha implementation, it certainly looks like Digital and its co-designer Mitsubishi are not going to repeat the mistakes of the past. The 21164PC promises to be cheap and fast.

If you happen to have a deep pocket or want to take a glance at what PC processors might look like in two or three years, the 21264 might be of interest. It is scheduled to become available in high-end machine during the second half of 1997. With this chip, CPU performance is expected to take another giant leap. Current estimates call for a performance level that is three to four times faster than the fastest CPUs available today.

Between each major chip generation, there are typically “half-generation” CPUs which have improvements that derive primarily from a shrink of the chip manufacturing process. For example, the 21064 chip was followed by the 21064A, and similarly, the 21164 was followed by the 21164A. In the former case, the core of the chip remained virtually identical to the 21064, but the primary caches doubled in size from 8KB to 16KB. In the latter case, instructions for byte and word accesses were added and the maximum clock frequency increased from 333 to 500MHz.

A summary of the performance attributes of the current Alpha chip family is presented in Table 1.

2394s1.ps (min=50, max=52)

Table 1. Summary of Alpha Chip Family

Linux Performance Analysis Tools

The state of Linux performance analysis tools is in rather dire straits (this is true for freeware in general, not just Linux). Commercial products currently have an edge in this area. For example, Digital Unix comes with an excellent tool (or rather tool generator) called ATOM (see Reference 2). ATOM is basically a tool that can rewrite any executable. While rewriting, it can add arbitrary instrumentation code to each function or basic block. Digital Unix comes with a bunch of tools built with ATOM: 3rd degree (a memory-leaks and bounds checker like the well-known purify) and a number of tools that give very detailed information on the performance behavior of a program (such as cache miss frequency, issue rates and so on). At present, the freeware community can only dream of such versatile tools.

While bleak, the situation is by no means hopeless. The few tools that are available make for powerful companions when properly used. Even good old GNU gprof has a few features of which you may not be aware (more on this later). Let's start with the most basic performance measurement—time.

Accurately Measuring Time

The Unix way of measuring time is by calling gettimeofday(). This returns the current real time at a resolution of typically one timer tick (about 1ms on the Alpha). The advantage of this function is that it's completely portable across all Linux platforms. The disadvantage is its relatively poor resolution (1ms corresponds to 500,000 CPU cycles on a 500MHz CPU) and, more severely, it involves a system call. A system call is relatively slow and has the tendency to mess up your memory system. For example, the cache gets loaded with kernel code so that when your program resumes execution, it sees many cache misses that it wouldn't see without the call to gettimeofday(). This is all right for measuring times on the order of seconds or minutes, but for finer-grained measurements, something better is needed.

Fortunately, most modern CPUs provide a register that is incremented either at the clock frequency of the CPU or an integer fraction thereof. The Alpha architecture provides the rpcc (read processor cycle count) instruction. It gives access to a 64-bit register that contains a 32-bit counter in the lower half of the register. This counter is incremented once every N clock cycles. All current chips use N = 1, so the register gets incremented at the full clock frequency. (There may be future Alpha processors where N > 1). The top half of the value returned by rpcc is operating system -dependent. Linux and Digital UNIX return a correction value that makes it easy to implement a cycle counter that runs only when the calling process is executing (i.e., this allows you to measure the process's virtual cycle count). With gcc, it's very easy to write inline functions that provide access to the cycle counters. For example:

static inline u_int realcc (void) {
    u_long cc;
    /* read the 64 bit process cycle
        counter into variable cc: */
    asm volatile("rpcc %0" : "=r"(cc)
                : : "memory");
    return cc; /* return the lower 32 bits */
    }
       static inline unsigned int virtcc (void) {
          u_long cc;
          asm volatile("rpcc %0" : "=r"(cc)
                        : : "memory");
          /* add process offset and count */
          return (cc + (cc<<32)) >> 32;
       }

With this code in place, function realcc() returns the 32-bit real-time cycle count, whereas function virtcc() returns the 32-bit virtual cycle count (which is like the real-time count except that counting stops when the process isn't running).

Calling these functions involves very small overhead. The slowdown is on the order of 1-2 cycles per call and adds only one or two instructions (which is less than the overhead for a function call). A good way of using these functions is to create an execution-time histogram. For example, the function below measures individual execution times of calls to sqrt (2.0) and prints the results to standard output (as usual, care must be taken to ensure that the compiler doesn't optimize away the actual computation). Printing the individual execution times makes it easy to create a histogram with a little post-processing.

void measure_sqrt (void) {
          u_int start, stop, time[10];
          int i;
          double x = 2.0;
          for (i = 0; i < 10; ++i) {
             start = realcc();
             sqrt(x);
             stop = realcc();
             time[i] = stop - start;
          }
         for (i = 0; i < 10; ++i)
             printf(" %u", time[i]);
             printf(""n");
       }

Note that the results are printed in a separate loop; this is important since printf is a rather big and complicated function that may even result in a system call or two. If printf were part of the main loop, the results would be much less reliable. A sample run of the above code might produce output like this:

120  101  101  101  101  101  101  101  101  101
Since this output was obtained on a 333MHz Alpha, 120 cycles corresponds to 36ns and 101 cycles corresponds to 30ns. The output shows nicely how the first call is quite a bit slower since the memory system (instruction cache in particular) is cold at that point. Since the square root function is small enough to easily fit in the first-level instruction cache, all but the first calls execute at exactly the same time.

You may wonder why the above code uses realcc() instead of virtcc(). The reason for this is simple—we want to know the results that were affected by a context switch. By using realcc(), a call that suffers a context switch will be much slower than any of the other calls. This makes it easy to identify and discard such unwanted outlying statistics.

The cycle counter provides a very low-overhead method of measuring individual clock cycles. On the downside, it cannot measure very long intervals. On an Alpha chip running at 500MHz, a 32-bit cycle counter overflows after just eight and a half seconds. This is not normally a problem when making fine-grained measurements, but it is important to keep the limit in mind.

Performance Counters

The Alpha chips, like most other modern CPUs, provide a variety of performance counters. These allow measuring various event counts or rates, such as the number of cache misses, instruction issue rate, branch mispredicts or instruction frequency. Unfortunately, I am not aware of any Linux API that would provide access to these counters. This is particularly unfortunate since both the Pentium and the Pentium Pro chips provide similar counters. Digital UNIX gives access to these counters via the uprofile and kprofile programs, and an ioctl-based interface documented in the pfm(7) man page. Hopefully, something similar (but more general) will eventually become available for Linux. With the proper tools, these counters can provide a wealth of information.

GNU gprof

Most readers are probably familiar with the original gprof (see Reference 3). It's a handy tool to determine the primary performance bottlenecks at the function level. However, with the help of gcc, GNU gprof can also look inside a function. We illustrate this with a truly trivial function that computes the factorial. Assume we've typed up the factorial function and a simple test program in file fact.c. We can then compile that program like this (assuming GNU libc version 2.0 or later is installed):

gcc -g -O -a fact.c -lc

Invoking the resulting a.out binary once produces a gmon.out file that contains the execution counts for each basic block in the program. We can look at these counts by invoking gprof, specifying the -l and --annotate options. This command generates a source-code listing that shows how many times a basic block in each line of source code has been executed.

Listing 1. Basic-block Execution Counts

Our factorial example results in the listing shown in Listing 1. The basic block starting at the printf line in function main() was executed once, so it has been annotated with a 1. For the factorial function, the function prologue and epilogue were executed 20 times each, so the first and last line of function fact are annotated with 20. Of these 20 calls, 19 resulted in a recursive call to fact, and the remaining call simply returned 1. Correspondingly, the then branch of the if statement has been annotated with 19, whereas the else statement has an annotation of 1. It's that simple.

There certainly are no surprises in the behavior of function fact(), but in realistic, more complicated functions or in code that was written by somebody else, this knowledge can be very helpful to avoid wasting time optimizing rarely executed code.

Optimization Techniques: Making Your Applications Fly

Next month, we'll look into the techniques that can greatly improve the performance of a given piece of code. Most of them are not novel. Some of them have been around for so long that it would be difficult, if not impossible, to give proper credit. Others are “obvious” (once you know them). The key point is that it is the characteristics of modern, particularly Alpha-based, systems which make these techniques so important and worthwhile.

Acknowledgments

The author would like to thank Richard Henderson of Texas A&M University and Erik Troan of Red Hat Software for reviewing this paper on short notice. Their feedback greatly improved its quality. Errors and omissions are the sole responsibility of the author.

David is a graduate student in the Ph.D. program of the Computer Science department at the University of Arizona. He plans on graduating in August 1997 and to finally get a “Real Job”. After a short intermezzo with Linux involving Reed-Solomon codes and the floppy-tape driver, he forgot about Linux until the need arose for a low-cost, high-performance system based on Digital's Alpha processor. As a result, he got involved in the Linux/Alpha port and has been sticking around in the free software community ever since. When not playing with computers, he enjoys the outdoors with his lovely wife. He can be reached via e-mail at David.Mosberger@acm.org.