Improving Server Performance

Chen Chen

David Griego

Issue #93, January 2002

How to improve your server's performance by offloading a TCP/IP stack from a Linux-based server onto an iNIC.

Looking to improve the performance on your high-end Linux server? Is your Linux system connected to a high-speed network? Are your servers spending too much of their resources on processing the TCP/IP stack and Ethernet frames? These are just some of the problems in today's network environment. These problems arise because TCP/IP traffic on the Internet and on private enterprise networks has been growing dramatically for the past decade, and there is no sign that growth will be slowing down any time soon. The widespread global adoption of the Internet and the development of new networking storage technologies such as iSCSI are driving networking speeds even faster. Although the processors being manufactured today are gaining speed at an astonishing pace, it is likely that network-related growth will continue to outpace the increasing processor speeds, slowing servers from their primary tasks to process network packets.

Intel has developed a prototype that offloads an entire TCP/IP stack from a Linux-based operating system onto an intelligent network interface card (iNIC).

The iNIC Design

The iNIC contains a real-time operating system (RTOS) and an entire TCP/IP version 4 stack. An I/O processor (IOP) on the iNIC processes all of the network packets allowing the host processor to process other tasks. To accomplish this division of labor, a thin layer of logic is needed on the host side to route all the network traffic through the iNIC.

Socket Offload

This technology is based on the intelligent I/O (I2O) architecture, which is already incorporated into the Linux 2.4.x kernel. Figure 1 is an I2O primer for those who are not familiar with I2O. The I2O specification is a message-based communication mechanism between a host operating system (OS) and I/O devices that are sitting on an IOP. The IOP runs an intelligent RTOS (IRTOS), which contains device-driver modules (DDMs) for each connected device. To obtain portability, I2O uses a split-driver model. The specification defines a set of abstract messages for each supported device class (i.e., LAN, tape, disk). The host OS uses the abstracted message layer to communicate with DDMs running on an IOP. The DDM translates these I2O messages to hardware-specific commands. To communicate with an I2O device, the host OS must have a driver that knows how to translate OS device commands to I2O device class commands. This module in the host OS is referred to as operating system module (OSM).

Figure 1. I2O Architecture

The solution is based on an extension of this architecture with the creation of the socket class. Changes were made to the I2O architecture to increase performance and minimize latencies usually associated with a split-driver model.

Figure 2. Socket Offload Architecture

The socket class defines messages needed for communication between the host OS and the DDM. The two drivers, OSM and DDM, communicate over a two-layer communication system. The message layer sets up the communication session, and the transport layer defines how information will be shared. The DDM is composed of two modules: intermediate service module (ISM) and hardware device module (HDM). The ISM provides the full functionality of the TCP/IP version 4 stack. The HDM is the device driver for the iNIC. The socket OSM is unlike any other network device in Linux. Normal network card drivers are protocol-independent and interface with the Linux kernel at the network application program interface (API). The socket OSM, on the other hand, will interface directly below the socket API. This allows the necessary socket services to be offloaded onto the IOP running the socket class ISM. The socket OSM replaces the services that the TCP/IP stack provided to the kernel, thereby providing necessary interfaces to the Linux kernel. It also transmits and converts socket requests and data in the socket offload format to the iNIC running the TCP/IP offload stack.

Socket OSM

The OSM is divided into the following subsystems: user interface, message interface, kernel interface and memory management.

The user interface replaces the af_inet socket layer in the kernel. It provides feedback to the users' programs exactly as the native (non-TCP/IP offloaded) kernel would provide.

The message interface provides the initialization and control of the socket offload system. It translates the user socket requests to the socket messages.

The kernel interface provides kernel services to the OSM. This is the point at which the OSM provides any services normally provided to the kernel by the TCP/IP stack. This subsystem was designed to minimize the modifications needed for the Linux kernel.

The memory management module provides the buffer pools needed for data transfer to and from the user-space applications. Memory management was designed to 1) minimize the number of data copies and DMA requests, 2) minimize the host interrupts, 3) avoid requiring costly physical-virtual address mappings and 4) avoid overhead of dynamic-memory allocation at runtime. Two pools of DMA-capable data buffers are maintained in the OSM. The transmit buffer contains the data headed for the iNIC; the receive buffer receives the data into the kernel from the socket device.

Figure 3. Linux TCP/IP Stack

As shown in Figure 3, Linux network components consist of a layered structure. User-space programmers access network services via sockets, using the functions provided by the Linux socket layer. The socket structure defined in include/linux/net.h forms the basis for the implementation of the socket interface. Below the user layer is the INET socket layer. It manages the communication end points for the IP-based protocols, such as TCP and UDP. This layer is represented by the data structure sock defined in include/net/sock.h. The layer underneath the INET socket layer is determined by the type of socket and may be the UDP or TCP layer or the IP layer directly. Below the IP layer are the network devices, which receive the final packets from the IP layer.

The socket OSM replaces the INET socket layer. All socket-related requests passed from the socket layer are converted into I2O messages, which are passed to the ISM on the IOP.

Embedded Target

The embedded system software consists of the messaging layer, TCP/IP stack, device driver and RTOS.

The messaging layer is the portion of the software that takes messages from the OSM, parses them and makes the socket call into the TCP/IP stack. This layer also takes replies from the network stack and sends the appropriate reply to the OSM. To improve performance and minimize the effects of latency inherent in split-driver systems, the messaging layer batches, replies and pipelines requests.

The embedded TCP/IP stack is a zero copy implementation of the BSD 4.2 stack. It provides all of the functionality of a networking stack to the messaging layer. Like all the software that runs on the IOP, the stack has been optimized for running on the Intel 80310 I/O processor chipset with Intel XScale microarchitecture (ARM-architecture compliant). Benchmarks were performed on the TCP/IP stack during optimization to ensure that it would perform well across all sizes of data traffic.

The HDM was written to take advantage of all the offloading capabilities of the NIC hardware. This includes TCP and IP checksums on transmit and receive, segmentation of large TCP packets (larger than 1,500 bytes) and interrupt batching supported by the chip. The NIC silicon chips supported were the Intel 82550 Fast Ethernet Mu.pngunction PCI/CardBus Controller and the Intel 82543GC Gigabit Ethernet Controller.

The RTOS is a proprietary OS that has been designed for the demands of complex I/O operations. This OS is fully I2O-compliant. It was chosen in part because of the willingness of the designers to make modifications to the OS for the prototyping efforts.

As described before, the socket calls made by the application layer are converted into messages that are sent across the PCI bus and to the I/O processor. This embedded system is a complete computer for performing I/O transactions. It consists of a processor, memory, RTOS and a PCI bus. Because it is designed for I/O, it will minimize the effects of context switching. Once a message reaches the IOP, it is parsed. The socket call that was requested by the application is then called on the embedded network stack. A reply message is sent to the OSM once the socket operation is completed.

Benchmark Results

The benchmark tests that were run using the prototype showed that the offloading of the TCP/IP stack significantly reduced both CPU utilization and the number of interrupts to the host processor. With a heavily loaded machine, the offloaded stack was able to maintain overall network performance and host CPU cycles were able to remain dedicated to the workload applications. In a native machine, the host processors were interrupted far more frequently, and the network application suffered from CPU resource starvation resulting in the network performance degradation.

Figure 4. Benchmark Results

Future Direction

As the subject of iSCSI (storage over IP by encapsulating SCSI in a TCP/IP packet) starts to heat up, desire for minimizing network overhead will continue to grow. Efforts used in moving the TCP/IP stack to an IOP quickly could shift to providing a full-featured TCP/IP stack at the back end of an intelligent iSCSI adaptor. This would minimize the impact of iSCSI to a Linux platform by making it available via the normal SCSI API. To compete with Fibre Channel, iSCSI must provide comparable performance.

Another future enhancement is that embedded Linux will be used for the RTOS. At the start of this prototyping effort, an Intel i960 RM/RN processor was used, and embedded Linux was not available. Since then, the Intel XScale microarchitecture has been introduced, enabling the adoption of the embedded Linux that is available for Intel StrongARM core. Porting of Linux-based StrongARM Linux to the XScale microarchitecture will be completed by the end of the year.

There were several goals behind this prototype effort: 1) to demonstrate that the enhanced performance achieved by offloading network tasks from the host processor reduces the host processor cycles otherwise consumed by processing of network data, 2) to show that the use of specialized software on the iNIC performs the same networking tasks while maintaining overall network performance and 3) to enable the use of I/O processors to work in conjunction with the host processors to handle the network traffic, thereby maximizing performance of a Linux-based server at minimal cost.

Offloading the TCP/IP protocol to a specialized networking software environment using embedded processors is an effective way of improving system performance. With the advancement of high-speed network deployments and adoption of network storage, TCP/IP will inevitably play an important role.

Acknowledgements

Technical contributors: Dave Jiang, Dan Thompson, Jeff Curry, Sharon Baartmans, Don Harbin and Scott Goble.

Chen Chen is doing research on advanced I/O applications at Intel. He can be reached at chen.chen@intel.com.

David Griego has been a Linux enthusiast for more than three years. He works at Intel in Advance Development Engineering. He can be reached at david.a.griego@intel.com.