Compiling Java with GCJ

Per Bothner

Issue #105, January 2003

Although Java isn't a popular choice for free projects, GJC can make it a viable option.

Java has not become as pervasive as the original hype suggested, but it is a popular language, used a lot for in-house and server-side development and other applications. Java has less mind-share in the free software world, although many projects are now using it. Examples of free projects using Java include Jakarta from the Apache Foundation (jakarta.apache.org), various XML tools from W3C (www.w3.org) and Freenet (freenet.sourceforge.net). See also the FSF's Java page (www.gnu.org/software/java).

One reason relatively few projects use Java has been the real or perceived lack of quality, free implementations of Java. Two free Java implementations, however, have been around since the early days of Java. One is Kaffe (www.kaffe.org), originally written by Tim Wilkinson and still developed by the company he cofounded, Transvirtual. The other is GCJ (the GNU Compiler for the Java language), which I started in 1996 at Cygnus Solutions (and which this article discusses). GCJ has been fully integrated and supported as a GCC language since GCC version 3.0.

Traditional Java Implementation Model

The traditional way to implement Java is a two-step process: a translation phase and an execution phase. (In this respect Java is like C.) A Java program is compiled by javac, which produces one or more files with the extension .class. Each such file is the binary representation of the information in a single class, including the expressions and statements of the class' methods. All of these have been translated into bytecode, which is basically the instruction set for a virtual, stack-based computer. (Because some chips also have a Java bytecode instruction set, it also can be a real instruction set.)

The execution phase is handled by a Java Virtual Machine (JVM) that reads in and executes the .class files. Sun's version is called plain “java”. Think of the JVM as a simulator for a machine whose instruction set is Java bytecodes.

Using an interpreter (simulator) adds quite a bit of execution overhead. A common solution for high-performance JVMs is to use dynamic translation or just-in-time (JIT) compilers. In that case, the runtime system will notice a method has been called enough times to make it worthwhile to generate machine code for that method on the fly. Future calls to the method will execute the machine code directly.

A problem with JITs is startup overhead. It takes time to compile a method, especially if you want to do any optimization, and this compilation is done each time the application is run. If you decide to compile only the methods most often executed, then you have the overhead of measuring those. Another problem is that a good JIT is complex and takes up a fair bit of space (plus the generated code needs space, which may be on top of the space used by the original bytecode). Little of this space can be in shared memory.

Traditional Java implementation techniques also do not interoperate well with other languages. Applications are deployed differently (a Java Archive .jar file, rather than an executable); they require a big runtime system, and calling between Java and C/C++ is slow and inconvenient.

The GCJ Solution: Ahead-of-Time Compilation

The approach of the GCJ Project is radically traditional. We view Java as simply another programming language and implement it the way we implement other compiled languages. As Cygnus had been long involved with GCC, which was already being used to compile a number of different programming languages (C, C++, Pascal, Ada, Modula2, Fortran, Chill), it made sense to think about compiling Java to native code using GCC.

On the whole, compiling a Java program is actually much simpler than compiling a C++ program, because Java has no templates and no preprocessor. The type system, object model and exception-handling model are also simpler. In order to compile a Java program, the program basically is represented as an abstract syntax tree, using the same data structure GCC uses for all of its languages. For each Java construct, we use the same internal representation as the equivalent C++ would use, and GCC takes care of the rest.

GCJ can then make use of all the optimizations and tools already built for the GNU tools. Examples of optimizations are common sub-expression elimination, strength reduction, loop optimization and register allocation. Additionally, GCJ can do more sophisticated and time-consuming optimizations than a just-in-time compiler can. Some people argue, however, that a JIT can do more tailored and adaptive optimizations (for example, change the code depending on actual execution). In fact, Sun's HotSpot technology is based on this premise, and it certainly does an impressive job. Truthfully, running a program compiled by GCJ is not always noticeably faster than running it on a JIT-based Java implementation; sometimes it even may be slower, but that usually is because we have not had time to implement Java-specific optimizations and tuning in GCJ, rather than any inherent advantage of HotSpot technology. GCJ is often significantly faster than alternative JVMs, and it is getting faster as people improve it.

A big advantage of GCJ is startup speed and modest memory usage. Originally, people claimed that bytecode was more space-efficient than native instruction sets. This is true to some extent, but remember that about half the space in a .class file is taken up by symbolic (non-instruction) information. These symbols are duplicated for each .class file, while ELF executables or libraries can do much more sharing. But where bytecodes really lose out to native code is in terms of memory inside a JVM with a JIT. Starting up Sun's JVM and JIT compiling and applications' classes take a huge amount of time and memory. For example, Sun's IDE Forte for Java (available in the NetBeans open-source version) is huge. Starting up NetBeans takes 74MB (as reported by the top command) before you actually start doing anything. The amount of main memory used by Java applications complicates their deployment. An illustration is JEmacs (JEmacs.sourceforge.net), a (not very active) project of mine to implement Emacs in Java using Swing (and Kawa, discussed below, for Emacs Lisp support). Starting up a simple editor window using Sun's JDK1.3.1 takes 26MB (according to top). XEmacs, in contrast, takes 8MB.

Running the Kawa test suite using GCJ vs. JDK1.3.1, GCJ is about twice as fast, causes about half the page faults (according to the time command) and uses about 25% less memory (according to top). The test suite is a script that starts the Java environment multiple times and runs too many different things for a JIT to help (which penalizes JDK). It also loads Scheme code interactively, so GCJ has to run it using its interpreter (which penalizes GCJ). This experiment is not a real benchmark, but it does indicate that even in its current status you can get improved performance using GCJ. (As always, if you are concerned about performance, run your own benchmark based on your expected job mix.)

GCJ has other advantages, such as debugging with GDB and interfacing with C/C++ (mentioned below). Finally, GCJ is free software, based on the industry-standard GCC, allowing it to be freely modified, ported and distributed.

Some have complained that ahead-of-time compilation loses the big write-once, run-anywhere portability advantage of bytecodes. However, that argument ignores the distinction between distribution and installation. We do not propose native executables as a distribution format, expect perhaps as prebuilt packages (e.g., RPMs) for a particular architecture. You still can use Java bytecodes as a distribution format, even though they don't have any major advantages over Java source code. (Java source code tends to have fewer portability problems than C or C++ source.) We suggest that when you install a Java application, you should compile it to native code if it isn't already so compiled.

Compiling a Java Program with GCJ

Using GCC to run a Java program is familiar to anyone who has used it for C or C++ programs. To compile the Java program MyJavaProg.java, type:

gcj -c -g -O MyJavaProg.java

To link it, use the command:

gcj --main=MyJavaProg -o MyJavaProg MyJavaProg.o
This is just like compiling a C++ program mycxxprog.cc:
g++ -c -g -O mycxxprog.cc
and then linking to create an executable mycxxprog:
g++ -o mycxxprog mycxxprog.o
The only new aspect is the option --main=MyJavaProg. This is needed because it is common to write Java classes containing a main method that can be used for testing or small utilities. Thus, if you link a bunch of Java-compiled classes together, there may be many main methods, and you need to tell the linker which one should be called when the application starts.

You also have the option of compiling a set of Java classes into a shared library (.so file). In fact, the GCJ runtime system is compiled to a .so file. While the details of this belong in another article, if you are curious you can look at the Makefiles of Kawa (discussed below) to see how this works.

Features and Limitations of GCJ

GCJ is not only a compiler. It is intended to be a complete Java environment with features similar to Sun's JDK. If you specify the -C option to gcj it will compile to standard .class files. Specifically, the goal is that gcj -C should be a plugin replacement for Sun's javac command.

GCJ comes with a bytecode interpreter (contributed by Kresten Krab Thorup) and has a fully functional ClassLoader. The standalone gij program works as a plugin replacement for Sun's java command.

GCJ works with libgcj, which is included in GCC 3.0. This runtime library includes the core runtime support, Hans Boehm's well-regarded conservative garbage collector, the bytecode interpreter and a large library of classes. For legal and technical reasons, GCJ cannot ship Sun's class library, so it has its own. The GNU Classpath Project now uses the same license and FSF copyright that libgcj and libstdc++ use, and classes are being merged between the two projects. We use the GPL but with the special exception that if you link libgcj with other files to produce an executable, this does not by itself cause the executable to be compiled by the GPL. Thus, even proprietary programs can be linked with the standard C++ or Java; runtime libraries.

The libgcj library includes most of the standard Java classes needed to run non-GUI applications, including all or most of the classes in the java.lang, java.io, java.util, java.net, java.security, java.sql and java.math packages. The major missing components are classes for doing graphics using AWT or Swing. Most of the higher-level AWT classes are implemented, but the lower-level peer classes are not complete enough to be useful. Volunteers are needed to help out.

Interoperating with C/C++

Although you can do a lot using Java, sometimes you want to call libraries written in another language. This could be because you need to access low-level code that cannot be written to Java, an existing library provides functionality that you need and don't want to rewrite, or you need to do low-level performance hacks for speed. You can do all of these by declaring some Java methods to be native and, instead of writing a method body, provide an implementation in some other language. In 1997, Sun released the Java Native Interface (JNI), which is a standard for writing native methods in either C or C++. The main goal of JNI is portability in the sense that native methods written for one Java implementation should work with another Java implementation, without recompiling. This was designed for a closed-source, distributed-binaries world and is less valuable in a free-software context, especially because you do have to recompile if you change chips or the OS type.

To ensure JNI's portability, everything is done indirectly using a table of functions. This makes JNI very slow. Even worse, writing all these functions and following all the rules is tedious and error-prone. Although GCJ does support JNI, it also provides an alternative. The Compiled Native Interface (CNI, which could also stand for Cygnus Native Interface) is based on the idea that Java is basically a subset of C++ and that GCC uses the same calling convention for C++ and Java. So, what could be more natural than being able to write Java methods using C++ and using standard C++ syntax for access to Java fields and calls to Java methods? Because they use the same calling conventions and data layout, no conversion or magic glue is needed between C++ and Java.

Examples of CNI and JNI will have to wait for a future article. The GCJ manual (gcc.gnu.org/onlinedocs/gcj) covers CNI fairly well, and the libgcj sources include many examples.

Kawa: Compiling Scheme to Native via Java

Java bytecodes are a fairly direct encoding of Java programs not really designed for anything else. However, they have been used to encode programs written in other languages. See grunge.cs.tu-berlin.de/~tolk/vmlanguages.html for a list of other programming languages implemented on top of Java. Most of these are interpreters, but a few actually compile to bytecode. The former could use GCJ as is; the latter potentially can use GCJ to compile to native code.

One such compiler is Kawa, which I have been developing since 1996. Kawa is both a toolkit for implementing languages using Java and an implementation of the Scheme programming language. You can build and run Kawa using GCJ without needing any non-free software. The Kawa home page (www.gnu.org/software/kawa) has instructions for downloading and building Kawa with GCJ.

You can use Kawa in interactive mode. Here, we first define the factorial function and then call it:

$ kawa
#|kawa:1|# (define (factorial x)
#|(---:2|#  (if (< x 2) x (* x (factorial (- x 1)))))
#|kawa:3|# (factorial 30)
265252859812191058636308480000000

An interesting thing to note is the factorial function actually gets compiled by Kawa to bytecode and is immediately loaded as a new class. This process uses Java's ClassLoader mechanism to define a new class at runtime for a byte array containing the bytecodes for the class. The methods of the new class are interpreted by GCJ's bytecode interpreter.

Of course, it is usually more convenient to put the code in a file:

$ cat > factorial.scm
(define (factorial x)
(if (< x 2) x (* x (factorial (- x 1)))))
(format #t "Factorial ~d is ~d.~%~!" 30 (factorial 30))
^D
$ kawa -f factorial.scm
Factorial 30 is 265252859812191058636308480000000.

You can increase the performance of Scheme code by using Kawa to compile it ahead of time, creating one or more .class files:

$ kawa --main -C factorial.scm
(compiling factorial.scm)
You can then load the compiled file:
$ kawa -f factorial.class
Factorial 30 is 265252859812191058636308480000000.
To compile the class file to native code, you can use gckawa, a script that sets up appropriate environment variables (LD_LIBRARY_PATH and CLASSPATH) and calls gcj:
$ gckawa -o factorial
--main=factorial -g -O factorial*.class
Using the wildcard in factorial*.class is not needed in this case, but it is a good idea in case Kawa needs to generate multiple .class files.

Then, you can execute the resulting factorial program, which is a normal GNU/Linux ELF executable. It links with the shared libraries libgcj.so (the GCJ runtime library) and libkawa.so (the Kawa runtime library).

The same approach can be used for other languages. For example, I am currently working on implementing XQuery, W3C's new XML-query language, using Kawa.

Other applications that have been built with GCJ include Apache modules, GNU-Paperclips and Jigsaw.

Conclusion

GCJ has seen a lot of activity recently and is a solid platform for many tasks. We hope that you consider Java for your free software project, using GCJ as your preferred Java implementation and that some of you will help make GCJ even better.

email: per@bothner.com

Per Bothner (www.bothner.com/per) has worked on GNU software since the 1980s. At Cygnus he was technical leader of the GCJ Project. He is currently an independent consultant.