Kernel Mode Linux

Toshiyuki Maeda

Issue #109, May 2003

Now you don't have to write a module to run a program in kernel space. Run any program there with this patch.

Kernel Mode Linux (KML) is a technology that enables the execution of ordinary user-space programs inside kernel space. This article presents the background, an approach and an implementation of KML. A brief performance experiment also is presented.

Traditional kernels protect themselves by using the hardware facilities of CPUs. For example, the Linux kernel protects itself by using a CPU's privilege-level facility and memory protection facility. The kernel assigns itself the most-privileged level, kernel mode. User processes are at the least-privileged level, user mode. Thus, the kernel is protected by CPUs, because programs executed in user mode cannot access memory that belongs to programs executed in kernel mode.

This protection-by-hardware approach, however, has a problem: user processes cannot access the kernel completely. That is, the kernel cannot provide any useful services, such as filesystems, network communication and process management, to user processes. In short, user processes cannot invoke system calls in the kernel.

To cope with this problem, traditional kernels exploit hardware facilities that modern CPUs provide for, escalating a program's privilege level in a safe and restricted way. For example, the Linux kernel for the IA-32 platform uses a software interrupt mechanism inherent to IA-32. The software interrupt can be seen as a special jump instruction whose target address is restricted by the kernel. At initialization, the kernel sets the target address of the software interrupt to the address of a special routine that handles system calls. To invoke system calls, a user program executes a special instruction, int 0x80. Then, the system-call handling routine in the kernel is executed in kernel mode. The routine performs a context switch; that is, it saves the content of the registers of the user program. Finally, it calls the kernel function that implements the system service specified by the user program.

The system call-by-hardware approach can become very slow, however, because the software interrupt and the context switch require heavy and complex operations. On the recent Pentium 4, the software interrupt and context switch is about 132 times slower than a mere function call.

By the way, recent Linux kernels for IA-32, versions 2.5.53 and later, use a pair of special instructions, sysenter and sysexit, for system calls. But, this is still about 36 times slower than a mere function call.

The obvious way to accelerate system calls is to execute user processes in kernel mode. Then, system calls are handled quickly because no software interrupts and context switches are needed. They can be function calls only, because the user processes can access the kernel directly. This approach may seem to have a security problem, because the user processes executed in kernel mode can access arbitrary portions of the kernel. Recent advances in static program analysis, such as type theory, can be used to protect the kernel from user processes. Many technologies enable this protection-by-software approach, including Java bytecode, .NET CIL, O'Caml, Typed Assembly Language and Proof-Carrying Code.

KML: Execute User Processes in Kernel Mode

As a first step toward a kernel protected by software, I have implemented KML. KML is a modified Linux kernel that executes user processes in kernel mode, which then are called kernel-mode user processes. Kernel-mode user processes can interact with the kernel directly. Therefore, the overhead of system calls can be eliminated.

KML is provided as a patch to the source of the original Linux kernel, so you need to build the kernel from the source. To use KML, apply the patch and enable Kernel Mode Linux when you configure your kernel. Build and install the kernel, and then reboot. The KML patch is available from www.yl.is.s.u-tokyo.ac.jp/~tosh/kml.

In current KML, programs under the directory /trusted are run as kernel-mode user processes. The kernel itself doesn't perform any safety check. For example, the following commands:

% cp /bin/bash /trusted/bin && /trusted/bin/bash

execute bash in kernel mode.

What Kernel-Mode User Processes Can Do

Kernel-mode user processes are ordinary user processes except, of course, for their privilege level. Therefore, they basically can do whatever an ordinary user process can do. For example, a kernel-mode user process can invoke all system calls, even fork, clone and mmap. In addition, if you use a recent GNU C library (2.3.2 and later or the development version from CVS), system calls are translated automatically to function calls in kernel-mode user processes, with a few exceptions, such as clone. Therefore, the overhead of system calls in your program is removed without modifying it.

The paging mechanism also works. That is, kernel-mode user processes each have their own address space, the same as ordinary user processes. Moreover, even if the kernel-mode user process excessively allocates huge memory, the kernel automatically pages out the memory, as it does for ordinary user processes.

Exceptions, such as segmentation faults and illegal instruction exceptions, can be handled the same as an ordinary user process, unless the program improperly accesses the memory of the kernel or improperly executes privileged instructions. As an example, build the following program and execute it as a kernel-mode process:

int main(int argc, char* argv[])
{
    *(int*)0 = 1;
    return 0;
}

The process is terminated by a segmentation fault exception, without a kernel panic. This example also indicates that the signal mechanism works.

As a second example, build the following program and execute it as a kernel-mode user process:

int main(int argc, char* argv[])
{
    for (;;);
    return 0;
}

Then, use Ctrl-C to send SIGINT to the process. Notice that it receives the signal and exits normally.

This second example also indicates that process scheduling works. That is, even if a kernel-mode user process enters an infinite loop, the kernel preempts the process and executes other processes. You may have noticed already that your system did not hang, even in the infinite loop of this example.

What Kernel-Mode User Processes Cannot Do

Although kernel-mode user processes are ordinary user processes, they have a few limitations. If a kernel-mode user process violates these limitations, the system will be in an undefined state. In the worst-case scenario, your system may be broken.

Limitation 1: don't modify the CS, DS, SS or FS segment register. Current KML for IA-32 assumes that these segment registers are not modified by kernel-mode user processes, and it uses them internally.

Limitation 2: don't perform privileged actions improperly. In kernel mode, programs can perform any privileged action. However, if your program performs such actions in a way that is inconsistent with the kernel, the system will be in an undefined state. For example, if you execute the following program as a kernel-mode user process:

int main(int argc, char* argv[])
{
        /* disable hardware interrupts */
        __asm__ __volatile__ ("cli");
        for (;;);
        return 0;
}

your system will hang.

In my experience, few applications violate these limitations. Ones that do violate them include WINE and VMware. These limitations are against only kernel-mode user processes. Ordinary user processes are never affected by these limitations, even when running on a KML-capable kernel.

KML Internals

In IA-32 CPUs, the privilege level of an executed program is determined by the privilege level of the code segment in which the program is executed. Recall that a program counter for IA-32 CPUs consists of a segment, specified by the CS segment register, and an offset into the segment, the EIP register. The privilege level of the code segment then is determined by its segment descriptor. A segment descriptor has a field for specifying the privilege level of the segment.

Basically, the Linux kernel prepares two segments, the kernel code segment and the user code segment. The kernel code segment is used for the kernel itself, and its privilege level is kernel mode. The user code segment is used for ordinary user processes, and its privilege level is user mode. When using execve on a user process, the original Linux kernel sets its CS segment register to the user code segment. Thus, the user process is executed in user mode.

To execute a user process as a kernel-mode user process, the only thing KML does is set the CS register of the process to the kernel code segment, instead of to the user code segment. Then the process is executed in kernel mode. Because of KML's simple approach, a kernel-mode user process can be an ordinary user process.

The Stack Starvation Problem and Its Solution

As described in the previous section, the basic approach of KML is quite simple. Its big problem is called stack starvation. First, I'll explain how the original Linux kernel handles exceptions (page faults) and interrupts (timer interrupts) on IA-32 CPUs. Then, I'll describe the stack starvation problem. Finally, I'll present my solution to the problem.

In the original Linux kernel, interrupts are handled by interrupt handling routines specified as gates in the Interrupt Descriptor Table (IDT). When an interrupt occurs, an IA-32 CPU stops execution of the running program, saves the execution context of the program and executes the interrupt handling routine.

How the IA-32 CPU saves the execution context of a running program at interrupts depends on the privilege level of the program. If the program is executed in user mode, the IA-32 CPU automatically switches its memory stack to a kernel stack. Then, it saves the execution context (EIP, CS, EFLAGS, ESP and SS register) to the kernel stack. On the other hand, if the program is executed in kernel mode, the IA-32 CPU doesn't switch its memory stack and saves the context (EIP, CS and EFLAGS register) to the memory stack of the running program.

What happens if a kernel-mode user process of KML accesses its memory stack, which is not mapped by the page tables of a CPU? First, a page fault occurs, and the CPU tries to interrupt the process and jump to a page fault handler specified in the IDT. However, the CPU can't accomplish this work, because there is no stack for saving the execution context. Because the process is executed in kernel mode, the CPU can never switch the memory stack to the kernel stack. To signal this fatal situation, the CPU tries to generate a special exception, a double fault. Again, the CPU can't generate the double fault, because there is no stack for saving the execution context of the running process. Finally, the CPU gives up and resets itself.

To solve this stack starvation problem, KML exploits the task management facility of IA-32 CPUs. The IA-32 task management facility is provided to support process management for kernels. Using the facility, a kernel can switch between processes with only one instruction. However, today's kernels don't use this facility, because it is slower than software-only approaches. Thus, the facility is almost forgotten by all.

The strength of this task management facility in IA-32 CPUs is that it can be used to handle interrupts and exceptions. Tasks managed by an IA-32 CPU can be set to the IDT. If an interrupt occurs and a task is assigned to handle the interrupt, the CPU first saves the execution context of the interrupted program to a task data structure of the program instead of to the memory stacks. Then, the CPU restores the context from the task data structure specified in the IDT.

The most important point is there is no need to switch a memory stack if the task management facility is used to handle interrupts. That is, if we handle page fault exceptions with the facility, a kernel-mode user process can access its memory stack safely.

However, if we handle all page faults with the facility, the performance of the whole system degrades, because the task-based interrupt handling is slower than the ordinary interrupt handling.

Therefore, we handle only double fault exceptions this way. So, only page faults caused by memory stack absence are handled by the task management facility. In my experience, memory stacks rarely cause page faults, and the performance decrement is negligible.

Performance Measurement

To measure the degree of performance improvement, I conducted two experiments. Both experiments compared performance of the original Linux kernel and KML. I used the sysenter/sysexit mechanism for performance measurement of the original Linux kernel, instead of the int 0x80 instruction. The experimental environment is shown in Table 1.

Table 1. Experimental Environment

In the first experiment, I measured the latency of the getpid and gettimeofday system calls. In the measurement, the system calls were invoked directly by user programs, without libc. The latency was measured with the rdtsc instruction. The result is shown in Table 2.

Table 2. Latency of System Calls (Unit: CPU Cycles)

The result shows that getpid was 36 times faster in KML than in the original Linux kernel, and gettimeofday was twice as fast in KML as it was in the original Linux kernel.

The second experiment is a file I/O benchmark using the Iozone filesystem benchmark. I measured the throughput of four types of file I/O: write, rewrite, read and reread. The measurements were performed on various file sizes from 16KB to 256KB, and the buffer size was fixed at 8KB. The underlying filesystem was ext3. In each measurement, I executed the Iozone benchmark 30 times and chose the best throughput.

The throughput of reread is shown in Table 3. Due to space limitations, the detailed results for write, rewrite and read are omitted.

Table 3. Throughput of reread: Buffer Size=8KB

The result shows that the throughput of reread in KML was improved by 6.8-21%. In addition, write was improved by 0.6-3.2%, rewrite was improved by 3.3-5.3% and read was improved by 3.1-15%.

These experimental results indicate that KML can improve the performance of applications that invoke system calls often, such as those that read or write many small files. For example, web servers and databases can be executed efficiently in KML.

I've performed a benchmark for the Apache HTTP server on KML. It didn't show performance improvement, because I have only a 100-base Ethernet LAN, which became the main bottleneck. If I perform the benchmark on a faster network (say, 1000-base Ethernet or faster), I predict it will show performance improvement.

In the preceeding experiments, it is worth noting that KML eliminated only the overhead of system calls. With some modification to the application, KML will be able to do more for performance improvement. For example, kernel-mode user processes can access I/O buffers directly in the kernel to improve I/O performance.

Resources

email: tosh@is.s.u-tokyo.ac.jp

Toshiyuki Maeda is a PhD candidate in Computer Science at the University of Tokyo. His favorite comics are Hikaru no GO (Hikaru's Go), Jojo no Kimyo na Boken (Jojo's Bizarre Adventure) and Runatikku Zatsugidan (Lunatic Acrobatic Troupe).