Garbage Collection in C Programs

Gianluca Insolvibile

Issue #113, September 2003

LISP and Java programmers take garbage collection for granted. With the Boehm-Demers-Weiser library, you easily can use it in C and C++ projects, too.

The first word that came to mind when I heard about introducing Garbage Collection techniques into a C or C++ program was “nonsense”. As with any other decent C programmer who loves this language, the thought of leaving the management of my own memory to others seemed nearly offensive. I had a similar feeling 15 years ago, when I first heard about compilers that would generate assembly code on my behalf. I was used to writing my code directly in 6510 opcodes, but that was Commodore 64—and a totally different story.

What Is Garbage Collection?

Garbage Collection (GC) is a mechanism that provides automatic memory reclamation for unused memory blocks. Programmers dynamically allocate memory, but when a block is no longer needed, they do not have to return it to the system explicitly with a free() call. The GC engine takes care of recognizing that a particular block of allocated memory (heap) is not used anymore and puts it back into the free memory area. GC was introduced by John McCarthy in 1958, as the memory management mechanism of the LISP language. Since then, GC algorithms have evolved and now can compete with explicit memory management. Several languages are natively based on GC. Java probably is the most popular one, and others include LISP, Scheme, Smalltalk, Perl and Python. C and C++, in the tradition of a respectable, low-level approach to system resources management, are the most notable exceptions to this list.

Many different approaches to garbage collection exist, resulting in some families of algorithms that include reference counting, mark and sweep and copying GCs. Hybrid algorithms, as well as generational and conservative variants, complete the picture. Choosing a particular GC algorithm usually is not a programmer's task, as the memory management system is imposed by the adopted programming language. An exception to this rule is the Boehm-Demers-Weiser (BDW) GC library, a popular package that allows C and C++ programmers to include automatic memory management into their programs. The question is: Why would they want to do a thing like this?

The Boehm-Demers-Weiser GC Library

The BDW library is a freely available library that provides C and C++ programs with garbage collection capabilities. The algorithm it employs belongs to the family of mark and sweep collectors, where GC is split into two phases. First, a scan of all the live memory is done in order to mark unused blocks. Then, a sweep phase takes care of putting the marked blocks in the free blocks list. The two phases can be, and usually are, performed separately to increase the general response time of the library. The BDW algorithm also is generational; it concentrates free space searches on newer blocks. This is based on the idea that older blocks statistically live longer. To put it another way, most allocated blocks have short lifetimes. Finally, the BDW algorithm is conservative in that it needs to make assumptions on which variables are actually pointers to dynamic data and which ones only look that way. This is a consequence of C and C++ being weakly typed languages.

The BDW collector comes as a static or dynamic library and is installed easily by downloading the corresponding package (see Resources) and running the traditional configure, make and make install commands. Some Linux distributions also come with an already-made package. For example, with Gentoo you need to type only emerge boehm-gc to install it. The installed files include both a shared object (libgc.o) and a static library (libgc.a).

Using the library is a fairly straightforward task; for newly developed programs, you simply call GC_alloc() to get memory and then forget about it when you do not need it anymore. “Forget about it” means setting all the pointers that reference it to NULL. For already existing sources, substitute all allocation calls (malloc, calloc, realloc) with the GC-endowed ones. All free() calls are replaced with nothing at all, but do set any relevant pointers to NULL.

GC_alloc() actually sets the allocated memory to zero to minimize the risk that preexisting values are misinterpreted as valid pointers by the GC engine. Hence, GC_alloc() behaves more like calloc() than malloc().

Using GC in Existing C Programs

If you want to try GC in an existing application, manually editing the source code to change mallocs and frees is not necessary. In order to redirect those calls to the GC version, you basically have three options: using a macro, modifying the malloc hooks and overriding glibc's malloc() with libgc's malloc(). The first approach is the easiest one; you simply need to insert something like:


#define malloc(x) GC_malloc(x)
#define calloc(n,x) GC_malloc((n)*(x))
#define realloc(p,x) GC_realloc((p),(x))
#define free(x) (x) = NULL

into the appropriate include files. This code substitutes only the explicit calls contained in your code, leaving startup and library allocations to traditional malloc/free calls.

A different approach is to hook malloc and friends to functions of your own, which in turn would call the GC versions. Listing 1 shows how to do it, and it can be linked directly to an existing program. See my article “Advanced Memory Allocation” [LJ, May 2003] for details on these hooks. With this method, any heap allocation is guaranteed to go through libgc, even if it is not performed directly by your code.

As a third alternative, you can pass --enable-redirect-malloc to configure before compiling the libgc library. Doing so provides the library with wrapper functions that have the same names as the standard glibc malloc family. When linking with your code, the functions in libgc override the standard ones, with a net effect similar to using malloc hooks. In this case, though, the effect is system-wise, as any program linked with libgc is affected by the change.

Do not expect to endow huge programs with GC easily using any of these methods. Some simple tricks are needed in order to exploit GC functions and help the collector algorithm work efficiently. For example, I tried to recompile gawk (version 3.1.1) using GC and obtained an executable ten times slower than the original. With some adjustments, such as setting each pointer to NULL after having freed it, the execution time improved significantly, even if it was still greater than the original time.

Garbage Collection in New Programs

If you are developing a new program and would like to take advantage of automated memory management, all you need to do is use the GC_malloc() family in place of the plain malloc() one and link with libgc. Memory blocks no longer needed simply can be disposed of by setting any referencing pointers to NULL. Alternatively, you can call GC_free() to free the block immediately.

Always remember that your whole heap is scanned periodically by the collector to look for unused blocks. If the heap is large, this operation may take some time, causing the performance of the program to degrade. This behavior is suboptimal, because large blocks of memory often are guaranteed never to contain pointers, including buffers used for file or network I/O and large strings. Typically, pointers are contained only in fixed positions within small data structures, such as list and tree nodes. Were C and C++ strongly typed languages, the collector could have decided whether to scan a memory block, based on the type of pointer. Unfortunately, this is not possible because it is perfectly legal in C to have a char pointer reference a list node.

For optimal performance, the programmer should try to provide some basic runtime type information to the collector. To this end, the BDW library has a set of alternative functions that can be used to allocate memory. GC_malloc_atomic() can be used in place of GC_malloc() to obtain memory blocks that will never contain valid pointers. That is, the collector skips those blocks when looking for live memory references. Furthermore, those blocks do not need to be cleared on allocation. GC_malloc_uncollectable() and GC_malloc_stubborn() also can be used to allocate fixed and rarely changing blocks, respectively. Finally, it is possible to provide some rough type information by using GC_malloc_explicitly_typed() and building block maps with GC_make_descriptor(). See gc_typed.h on the Linux Journal FTP site for more information [available at ftp.linuxjournal.com/pub/lj/listings/issue113/6679.tgz].

The collector's behavior also can be controlled by the user through a number of function calls and variables. Among the most useful ones are GC_gcollect(), which forces a full garbage collection on the whole heap; GC_enable_incremental(), which enables incremental mode collection; and GC_free_space_divisor, which tunes the trade-off between frequent collections (high values, causing low heap expansion and high CPU overhead) and time efficiency (low values).

Heap status and debug information is available through a number of functions, including GC_get_heap_size(), GC_get_free_bytes(), GC_get_bytes_since_gc(), GC_get_total_bytes() and GC_dump(). Many of these parameters and functions are not documented at all, not even in the source code itself. As always, a good editor is your friend.

Performance and Behavior

A single best approach to memory management that is effective for any program does not exist. Given a specific application, the optimal solution must be found by compromising on a number of different factors, including CPU overhead, heap expansion, allocation latency and, last but not least, manageability and robustness of the code. Profiling the program and testing different memory strategies probably is the best solution for dealing with these issues.

Garbage Collection and Compiler Optimizations

One subtle point against GC is it requires extra care if compiler optimizations are switched on. The collector may wrongly assume a certain pointer has disappeared simply because references to it have been optimized out. Thus, the corresponding memory block might be freed even if it still is in use by the program, with the obvious consequences. Hence, it might be tempting to turn compiler optimizations off to be safe, losing part of the performance gain obtained by using GC.

In order to have an idea of the behavior and performance of GC vs. traditional memory management, we have experimented with a test program, gctest, which loops over the creation and destruction of a simple list. Simple as it may seem, the test raises some interesting points. The source code is not really instructive and is too long to be printed, so it is available for download on the FTP site (ftp.linuxjournal.com/pub/lj/listings/issue113/6679.tgz).

gctest can be controlled with a number of options that allow you to experiment with different list lengths and node sizes, as well as to change working parameters—enabling and disabling specific features of the GC library. Before we comment on the results we obtained with this test tool, it is important to point out that they were obtained in an artificial and not excessively realistic environment. So, again, you are invited to test the GC library yourself and evaluate it for your own code. Take the parameters presented here as possible indicators of the library suitability to a particular application.

The measurements we collected are overall execution time, the CPU load; heap expansion, how much memory is requested by the system with respect to how much actually was allocated by the program; and allocation latency average and standard deviation, how long it takes to allocate a single block of memory and how much this time varies across different allocations. The meaning of the first parameter is quite obvious and needs no further explanation. Heap expansion is a measure of how much of the allocated memory is fragmented and what amount of extra memory is requested by the library from the operating system. As we will see, the library may allocate a 1MB block ten times, freeing it after each allocation and requesting a total of 10MB of the system, as if freed memory was not put back on the free list. Although this sometimes is desired behavior, needed to optimize the allocation strategy, it can be annoying on systems with a limited availability of RAM. It can become a further source of CPU overhead if swap space is involved. Finally, allocation latency is important for real-time applications, which need the longest allocation time to be bounded. Typical cases are multimedia playback applications and specialized embedded systems that need to react to external events in a predictable amount of time.

Some Comments on the Results

Our test box A is a Pentium 4 2.53GHz system, with 1GB of RAM running Gentoo Linux (all code is optimized for the CPU architecture) and glibc 2.3, which has an improved memory management algorithm over glibc 2.2. Test box B is a K6-II 400MHz laptop with 128MB of RAM, running Slackware Linux with glibc 2.2.

Our first test consisted of allocating a list of 150,000 nodes, 16 bytes each, 30 times. On each loop, the allocated list was destroyed, that is, free()ed in case of traditional management, unlinked in case of GC. The test commands were:


gctest -tu -s 4 -n 150000 -l 30

and:


gctest -gu -s 4 -n 150000 -l 30

The overall execution time, on box A, was 3.80 seconds with traditional management and 2.43 seconds with GC, an improvement of about 35%. The same test performed on box B showed an even greater improvement, around 40%. This first test shows that, contrary to popular belief, GC actually can be quite faster than malloc/free. Heap expansion is rather limited and amounts to about 2 in both cases. What is even more surprising is that allocation latency is the same—6.7 microseconds—with a slightly larger deviation for the GC case. Also interesting is that by calling GC_gcollect() at each loop (option -G), the overall execution time decreases by 0.1 second. This result is counterintuitive, because we have one more function call in the loop.

Now, let's see what happens if we forget to destroy the list at the end of each loop. In the traditional management case, the test executes faster, 2.58 vs. 3.80 seconds, but the peak heap expansion is 140MB, which is twice the overall allocated memory. In the GC case, the test aborts due to memory exhaustion unless we call for an explicit collection (-G) at the end of each loop. By doing so, we obtain the lowest execution time, 2.32 seconds. This probably is quite far from what we could have imagined a priori—that's why actual experimentation is important for finding the optimal solutions.

The same test also has been performed on box B, but with an unoptimized Slackware distribution and glibc 2.2. It is interesting that although the improvement of GC over malloc/free still was around 40%, the test ran 27% faster under Gentoo.

The second test we made shows some limitations of the GC library. The test conditions actually were quite extreme: we tried to allocate five lists, each having 1,500,000 nodes, with each node being 16 bytes long. Although the malloc/free version ran correctly, the GC version did not complete the test because of memory exhaustion. The problem probably is due to the large number of blocks consecutively allocated.

The third test used larger nodes, 140 bytes each, and a shorter list length, 150,000 nodes. We ran:


gctest -tu -s 128 -n 150000 -l 5

and:


gctest -gu -s 128 -n 150000 -l 5

Under these conditions, the improvement of GC over malloc/free was around 20%, 0.85 vs. 0.60 seconds. Recall, though, the GC library always clears the allocated blocks, but malloc() does not. The overhead associated with this operation grows linearly with node size and thus is more important with larger nodes. To make a fair comparison, then, we need to substitute malloc() with calloc() at those points where GC_malloc() is called, as happens in:


gctest -tuc -s 128 -n 150000 -l 5

This test yields an execution time of 0.88 seconds and brings the GC improvement to 32%. Heap expansion is greater in the GC case, with a value of 1.7 vs. 1.0. Allocation latency is practically the same for both traditional and GC management, although a larger latency variation was experienced in the latter. Enabling incremental collection (-i option) did not lower the variation, although introducing calls to GC_free() (-F option) to explicitly free the list nodes explicitly actually did yield better results than the malloc/free case, on both execution time and latency. However, in this case we are not strictly using a real garbage collection approach.

Testing even larger memory blocks makes the difference between traditional and GC memory management quite noticeable. On 4KB nodes, GC performed quite poorly in comparison with malloc/free on execution time, 0.85 vs. 2 seconds; heap expansion, 2.75 vs. 1; and latency, 0.7 milliseconds vs. 1.6 microseconds. When compared with calloc/free performance, the execution time of GC still is quite competitive (40% faster). But, issues related to heap and latency remain.

Conclusion

GC techniques often are surrounded by myths and legends. In this article we have shown that GC actually can perform better than malloc/free. These advantages do not come for free, however, and the correct use of the library requires minimum knowledge on its internal mechanisms.

There is no final verdict on the suitability of GC for C programs. For now, the BDW library can be one more tool in your box, to be given serious consideration the next time you deal with a complex piece of software. Several open-source projects, like the Mono Project and GNU gcj Java runtime, have been using it for a while.

Finally, the BDW library also can be used proficiently as a leak detector; the Mozilla team uses it this way, for example. For more information, have a look at the included documentation.

Gianluca Insolvibile has been a Linux enthusiast since kernel 0.99pl4. He currently deals with networking and digital video research and development. He can be reached at g.insolvibile@cpr.it.