Linux on Track

Harald Kirsch

Issue #49, May 1998

Linux was used in two projects as a data acquisition system running more or less autonomously in the German ICE trains. This article describes design issues and implementation as well as the problems and solutions used in those projects.

The Fraunhofer Gesellschaft is a non-profit research organization specializing in applied research and acting as a proponent of technology transfer between basic research and industry. With nearly 50 institutes in Germany, almost all aspects of science are covered. Part of the IITB (Institut für Informations- und Datenverarbeitung) specializes in applications of computer-based monitoring, control and diagnosis of industrial processes and equipment by means of signal and image analysis. As such, our group was involved in two projects requiring data acquisition and analysis in one of the German high speed trains, the InterCity Express (ICE). This article describes how the data acquisition was implemented using Linux.

The Projects: UNRA and ICE-D

Project UNRA (unrunde Räder) tried to discover the reasons wheels of high speed trains become out-of-round sooner than expected. As railway experts know, train and coach wheels become out-of-round, i.e., the difference between the minimum and the maximum radius of a wheel becomes non-negligible. For ICE wheels, DB-AG uses 0.6 mm as the threshold above which wheels must be changed. Despite being only 0.1% of the wheel radius, such differences induce low-frequency vibrations into the coach structure. They not only put additional stress on structural materials but also cause an unpleasant ride for passengers.

ICE High Speed Train

As part of this project, a measuring axle developed and patented by Forschungs- und Technologiezentrum der DB-AG (Minden, Germany) was installed in a regular running ICE to measure triaxial forces in the stand-up point of the wheel. In addition, four accelerometers were installed, three on the axes at the bearing and one in the coach measuring vertical acceleration. In addition, a resolver was attached to the axle. It delivers 1440 TTL-edges for one turn of the wheel. The edges are used to clock 90 A/D conversion for one turn of the wheel, resulting in a sampling rate dependent on the wheel's rotation speed but synchronous with its rotation.

With additional channels not mentioned above, a total of 17 channels had to be measured. Since the radius of an ICE-wheel is 460 mm, 1 km amounts to:

samples, resulting in:

of data every day.

The second project, called ICE-D, did not have such a great demand on bandwidth, but it required some data analysis to be computed on-line. Since the system is quite similar to the one used in the UNRA project, this project will not be described in such detail. Suffice it to say that another Linux computer will run for two years in a coach car with a new type of bogie (the assembly of four wheels on which the coach rests) to acquire and analyse 50 data channels.

Hardware

The computer hardware used in both projects is not identical but not very exciting in either case. In UNRA we use a garden variety Pentium 90 system with Adaptec 2940 SCSI controller, a sufficiently large Quantum disk, which seems to have settled its disagreements with the SCSI controller, and an HP DAT streamer. For the ICE-D project very similar hardware is used. Comparatively expensive were the 19-inch cases which were necessary to mount the computer in a rack in the train.

Of more interest might be the measurement hardware. In UNRA there is an Analog Devices RTI-860, which is a 16-channel, 12-bit A/D (analog/digital) conversion board. What makes it particularly suited for use with Linux is its on board memory of 256K samples, relieving us from hard real-time constraints.

Another board, called ADCO, was developed along the specifications of IITB by CMS. It is basically a device to measure times with a 40 MHz clock driving a 16-bit counter. On external events, the counter is read into a FIFO of one kilometer word length and then reset to start again from zero. Consequently the FIFO collects counter values representing a sequence:

of times between events. The events are generated by the resolver mounted on the axle and happen once for every four degrees of wheel rotation. Knowing how much time the wheel needs to rotate four degrees, we can calculate the rotational speed of the wheel with high precision. The one kilometer sample FIFO on the ADCO is small compared to the buffer on the RTI-860 and actually proved to be too small for the first driver developed (see below).

Project ICE-D required another kind of measurement hardware because the task was not to acquire data at high speed but to measure up to 64 channels. We decided on an RTI-834 from Analog Devices because it can measure 32 channels and is probably the only board around with a useful programmer's manual. The manual is not free (approximately 250 DM) but, believe it or not, it contains almost all details about programming the hardware, including examples that made it comparatively easy to write a device driver.

In addition to these devices, a GPS (Global Positioning System) device was mounted on the coach. Position information was recorded with the data to be able to correlate it later with actual locations on the track.

Software

The software comprises several main components: the data acquisition program, a watchdog and a taper, a GPS monitor and device drivers. These components are described in the following subsections.

Data Acquisition

Except for the time between two and four o'clock in the morning, the data acquisition is active and digitizes data. Because the data acquisition is synchronous with wheel rotation, the data rate depends on the train's speed. At 300kph the wheel rotates at about 29Hz resulting in a data rate of

which has to be streamed to disk without loss.

Every 345 rotations of the wheel (about one kilometer), the hardware is reset to trigger again on the next zero-degree marker of the resolver. At that time the data acquisition program fetches the most recent information from the GPS, writes it to file, closes the file and opens a fresh one. Each file covers one kilometer of track and is nearly one megabyte in size. This approach was chosen for several reasons:

  1. One megabyte and one kilometer are convenient sizes to handle with data analysis software.

  2. Synchronising every kilometer makes sure that losing individual events from the resolver due to noise will not spoil all data for the rest of the day.

  3. One kilometer was determined to be a useful checkpoint to record GPS information.

  4. The files are not created anew each day but are overwritten for efficiency reasons. In case of a power failure, it is almost impossible to find out how much of a file is new and how much is from the day before; therefore, a partly written file has to be thrown away. Throwing away up to one kilometer of data is a reasonable tradeoff between number of files and amount of data lost.

Of course, there is nothing magic about one kilometer. Two kilometers or one half kilometer would probably have worked equally well.

While reading data from the devices, the data acquisition program also monitors the wheel's rotational speed to check whether the train's speed is above 60kph. Below that threshold, data is considered to be of no interest and is thrown away. In particular the file currently being written is reset and reused as soon as the speed rises above the threshold. Of course, up to one kilometer worth of data recorded at speeds above 60kph is discarded, but in fact, the threshold of 60kph is a rough guess anyway so no harm is done by discarding some data recorded at speeds slightly above 60kph. Typical travelling speeds of the ICE are, depending on track type, 100kph, 160kph, 250kph and 280kph, and only those speeds were of major interest in the project.

The data acquisition program is rather simple, most of it doing error handling in case of read or write errors. Since device drivers were implemented for the RTI-860 as well as for the ADCO, digitizing is as easy as opening a file and reading from it. The only thing requiring even minimal thought was that the data rate from the two drivers is not identical. Reading the same amount of data from both devices in every course through the main loop would soon fill up one of the driver's buffers. A general solution in such cases is the use of the select() system call; however, in the given case, the exact ratio between the two data rates was known and the amount of data read from each driver in every read-call was chosen accordingly.

cron Jobs

At two o'clock in the morning the data acquisition process stops recording data in order not to interfere with other work done at that time. First, a cron job reboots the system as a preventive measure against memory leaks. Although none were observed, rebooting costs nothing and does no harm. After the boot, the acquired data is written to tape with a script started as a cron job which ultimately calls tar.

A minor nuisance was that it is almost impossible to find out how much space is used on the tape if internal compression of the DAT drive is enabled. Assuming that the compression ratio is about the same every day, it would probably have been possible to put two days' worth or 1.5GB of uncompressed data onto a 2GB tape. Since the A/D converter only delivers 12 bits which are stored as 16-bit values, a compression to 1.125GB should be trivial. Another 12% reduction is probably possible because most of the time the digitized signals do not cover the full 12 bits.

During the rest of the day, i.e., not between two and four o'clock in the morning, another cron job is started every ten minutes. As a measure against yet unknown bugs in the data-acquisition program which may cause it to crash, a watchdog program checks if the data-acquisition process is still in the process table, and if it is, assumes that it is doing something useful. If it is not in the table, the system is rebooted. As of this writing the watchdog has still to prove its utility, since no such incident has been found in the log files.

GPS monitor

The GPS device is a cute little gadget that looks like a computer mouse without buttons. It is about the same size, shape and color as a mouse and is mounted on the top of the coach to have a clear view of the sky. It is connected to the computer via a serial line which also delivers power to the device. As soon as the GPS is connected, it starts sending several types of information which can be read with a command as simple as:

cat /dev/ttyS1

as long as /dev/ttyS1 is set to the correct baud rate. By writing to the device, it can be programmed to deliver only certain types of information.

The high speed data acquisition has one minor deficiency—it only delivers a dataset once per second. As described above, the positioning information is entered into every data file one kilometer in size. Now suppose that the data acquisition process starts reading from the GPS after it has acquired the last one kilometer sample. Reading may take up to one second while the wheel turns up to 28 times per second, thereby losing about 80 meters of data.

Since losing data was not considered efficient, the gpsmonitor program was introduced, running parallel to the data acquisition process. It reads the position information at the given rate of one second and stores it in a file where the most recent information is always available for the data acquisition process.

To make sure that the data acquisition process does not read partially written data sets, in general it would be necessary to use a file locking scheme to bar the acquisition process from reading while gpsmonitor is writing its data. However, one data set is only 80 characters in length and is sent to the file in one write-operation. Checking the Linux kernel sources might show that this is not an entirely atomic operation, but experiments with a process rereading the information at the highest possible frequency have shown that the probability that a write of 80 characters would be interrupted by another process is practically zero, i.e., was not observed. Consequently, file locking was considered to be unnecessary overhead.

Device Drivers

The most interesting part of the project for the Linux hacker is certainly the device driver section. As usual, no device driver could be obtained from manufacturers of the boards to be used. It is a pity that no manufacturer of measurement hardware recognizes the potential of Linux as a measurement platform. Certainly not a real-time system, but with today's fast processors and some precautions, Linux is able to stream data to the disk at high rates and without dropping data.

Writing a device driver seemed to be a daunting task, and it proved to be exactly that, but for reasons other than the expected ones. Not being particularly familiar with the internals of Linux, it first seemed that learning the interface between kernel and driver might be a complicated problem. It proved to be almost too trivial to mention. With an early version of the kernel hacker's guide, code of other drivers all around, and the helpful Net community, communication between kernel and driver was easily established.

The bad part was the hardware, mainly due to a lack of decent documentation. German distributors were approached to almost no avail, even for analog devices. Linux was as yet unheard of, and all that could be obtained was source code for MS-DOS and a user's guide for the RTI-860 containing a full schematic diagram. For the ADCO, the situation was just about the same. Nevertheless drivers were written, and the work is almost perfect today. Only the RTI-860 driver still contains a nasty bug, probably due to a timing problem: clearing the on-board memory and enabling the trigger cannot be done in the right order. Independent of which operation is done first, some samples are sometimes dropped, presumably only if the trigger line goes active very shortly after the trigger is enabled.

Another problem is the kernel itself. This problem was observed in 1.x kernels and seems to persist in 2.0.x kernels. Because the ADCO board has only a one kilometer sample FIFO and must be emptied before it overflows, at a 50KHz sampling rate the driver has to read the data out at a rate of 50Hz. Put another way, the driver has to have a look at the board at least every 20ms. With a time slice of 10ms in a typical Linux kernel, this must happen every other jiffy. For those not familiar with kernel code, it should be noted that there is a variable called a jiffy in the kernel, which is incremented by the timer interrupt. In the Linux kernel, a jiffy is defined rather exactly to be 10ms. In particular with the POSIX scheduler available in recent kernels, this should not present a problem. In contrast to the normal Linux scheduler which constantly changes process priorities to distribute processor time in a fair way, the POSIX scheduler allows a fixed priority to be attached to a process. With the right priorities, at least one process can be guaranteed to get the processor at the next scheduling event after it makes a request. This should be at the next tick of the clock, which is at most 10ms away.

In practice it was found that sometimes no scheduling occured for 40, 50 or even 100ms, which was even more irritating as no other process was active at that time. It looked very much like the mechanism responsible for paging and/or swapping was responsible for it, but due to limited resources, the problem could not be further investigated.

As a workaround, a mechanism in the kernel was exploited which allows small pieces of code to run between two jiffies. Although no scheduling was performed for up to 100ms, the timer interrupt was not blocked and ticked along fine every 10ms. One of its tasks is to run code which is registered on a certain queue by other parts of the kernel. By registering a function which reads the ADCO's FIFO into a driver-internal buffer, the problem of missing scheduling events could be circumvented. In fact, it is not even necessary to use the POSIX scheduler.

Conclusion, Open Questions and Lessons Learned

Linux proved to be an absolutely stable platform for software development and autonomous data acquisition. The three finger salute (ctrl-alt-del), well known on certain widespread desktop program launchers, is never necessary on Linux.

Using A/D conversion boards with on-board memory precludes all real-time constraints. Boards with too little memory are not easily supported. The fact that scheduling is sometimes suppressed for more than 100ms is considered a bug and first resulted in some hectic and active kernel debugging in cooperation with Ingo Molnar (Wien). It turned out that there seemed to be more than one reason for the problems, and they were reported to the kernel developers by Mr. Molnar. However, since we could not wait for the problem to be corrected (a simple patch seemed not to be enough), the solution described above was chosen.

Programming feature-rich A/D conversion boards proved to be more complicated than expected. Even the driver for the well-documented RTI-834 was not easy because of the many dependencies in time and logic between subcomponents of the board. It seems as if a general problem with A/D conversion boards is that designers put too many features on one board introducing dependencies and side effects only they are able to deal with correctly. This might be the reason why it is usually not possible to get good documentation—it simply does not exist, because nobody is able to write it.

A new and very interesting trend in measurement devices was recently initiated by Intelligent Instrumentation (a Burr Brown company). Their EDAS (Ethernet Data Acquistion System) is a 16 channel, 12 bit, 100KHz A/D conversion device which can be hooked to the Ethernet. For UNIX they deliver a library in source code to talk to the device, i.e., program it and read the data. No new device driver must be written. The device can either be connected to a local network or, if continous high speed transfer is necessary, it can be connected to its own “network”--a direct line between the device and a dedicated Ethernet board in the computer. However, while this idea is very nice and is similar to those fashionable WebCams, the EDAS is a bit broken for two reaons: A minor annoyance is that it does not understand RARP (reverse address resolution protocol). To set its IP address, it has to be connected to a computer via a serial line. A more major problem is the device's inability to continuously pump the 100KHz it samples onto the Net. After the first enthusiasm we were very disappointed when the German distributor told us that the EDAS' microcontroller can fill the internal 32 kilometer samples of memory at 100KHz, but that it is too slow to stream the data to the Ethernet at the same speed.

Considering the price of 2500 DM (about $1400 US), it would be cheaper to combine a single-board PC (1000 DM) with an A/D conversion board (1000 DM) and, say, some flash RAM as replacement for a disk into a small case. Install a minimal Linux and a suitable daemon as an interface between IP and the device driver of the A/D board, and you have an iDAB (Internet Data Acquisition Box). Depending on the application, you can even install software to preprocess the data before it is passed to the network.

Harald Kirsch is currently employed at IITB where he managed to convert his group from DOS-based to Linux-based systems. In his free time he works on his degree. If he has free time after work and school, Harald likes to swim, cycle and play volleyball and badminton. He can be reached at kir@iitb.fhg.de.