lex and yacc: Tools Worth Knowing

Dean Allen Provins

Issue #51, July 1998

Today, computers can talk and they can listen—but how often do they do what you want?

This article is about how Linux was used to program a Sun machine to manipulate well-log recordings to support finding oil and gas exploration in Western Canada. It will describe the problem, provide enough background to make the problem understandable, and then describe how the two fascinating UNIX utilities lex and yacc were used to let a user describe exactly what he wanted to satisfy his particular need.

Some Background

In the fall of 1993, I had been recently downsized and was approached by a former colleague for assistance. He, and another former colleague, both of whom were still with my last employer, were what is known in the industry as well-log analysts.

To understand what a log analyst is requires a little knowledge of oil and gas exploration methods. Energy companies, as they like to be known, employ several different categories of professionals to assist them in finding salable quantities of hydrocarbons. Chief among these are the geologists and geophysicists (of which I am one) who study the recordings made in bore holes, or process and examine seismic recordings to identify what are popularly known as “plays” or “anomalies”.

Bore holes are simply the holes left when a drill rig moves off the drilling platform. Generally, these holes are logged by service companies who lower instruments called tools into the hole, and who then record on magnetic tape the readings made by those instruments.

There are many different types of tools, including sonic (which measures the time needed for a pulse of sound energy to travel through the rock wall from one end of the tool to the other), density (a continuous recording of the rock wall density), and gamma ray (a measure of gamma radiation intensity in the rock). These are just a few of the types of measurements that are made, recorded and called logs.

The various logs are examined by geologists to gain an understanding of what was happening when the rocks forming the bore hole were laid down, and what has happened to them subsequently as shallower rocks were created above them.

Geophysicists are more inclined to study seismic recordings which in essence are indirect measurements of the properties of the rocks forming the subsurface. Geophysics and Linux will not be discussed here, but you may find Sid Hellman's “Efficient, User Friendly Seismology”, Linux Journal, August 1995, Issue 16 of interest.

While seismic recordings provide much greater volumes of interpretable data over large areas, well logs are definitive measurements made at single locations, sometimes close together, and sometimes not. Geologists often correlate adjacent well logs to create cross sections of the subsurface, much like seismic recordings would provide. Detailed interpretation of individual logs, however, is often left to the log specialists.

The Problem

My two acquaintances were log specialists who wanted an additional tool to assist them in the processing and interpretation of individual or combinations of logs. Specifically, they wanted to tell the computer to perform arithmetic operations on individual or some algebraic combination of logs.

For example, they might need to scale a specific log by an arbitrary amount, say 1.73. In another case, they might want to divide one log by another, and then add the result to a third, all before adding a constant and then raising the resulting values to some arbitrary power.

Keeping in mind that logs are composed of individual sample values taken as frequently as every few inches (or centimeters as they are here in Canada and many other places in the world), these example requests would mean a reasonable amount of computation, multiplying every sample of thousands of meters of recorded log values by 1.73, in the first example. The simple scaling problem isn't particularly difficult, but identifying the desired log could be.

The energy company for which my acquaintances worked was using a simple filing method for all the log curves (a curve corresponds to all the recorded samples for one tool in one bore hole) wherein each curve was identified by a name. To this was added some additional information on units and so on, plus all the samples for all the curves for the well. All the data were stored as ASCII. (The file format is known as Log ASCII Standard format, or LAS version 2.0, and although the names for curves were generally the same from well to well, that was not guaranteed.)

As a result, more complicated combinations of curves required a fairly sophisticated and robust mechanism for arbitrary name recognition, while the desired algebraic operation was being described. Given such an interesting challenge, I recognized an opportunity to put some relatively little-used tools to work: lex and yacc.

The Tools

The program lex is a lexical analyzer. Lexical analysis is the recognition of words in a language. As used in this particular application, lex, or more specifically flex, is used to recognize characters forming the names of log curves, arithmetic operators and algebraic groupings.

flex is a particular example of the lexical analysis programs available for UNIX systems and is the implementation delivered with Linux distributions. The original implementation was done by Mike Lesk and Eric Schmidt at Bell Laboratories and is documented in the book lex & yacc by John R. Levine, Tony Mason & Doug Brown, O'Reilly & Associates, Inc., 1992.

yacc is a language parser. It accepts word items and, given a list of rules describing how these items form larger entities, deduces which items or combinations of items satisfy a rule. This can also be thought of as grammatical analysis. Once a rule is satisfied, a programmer's code is applied to the result.

In the case of English, the words in a sentence can be assigned grammatical types such as noun, verb, adjective and so on. Particular combinations of words form more complex units and these in turn can be described as complete sentences.

For example, the sentence “The lazy dog lay in the sun,” is composed of an article “the”, a preposition “in”, adjective “lazy”, nouns “dog, sun” and a verb “lay”. Combinations of these grammatical items form more complex entities such as noun phrases “The lazy dog” and “in the sun”. The first noun phrase is the subject of the sentence, and the second, in combination with the verb, forms the predicate. Together they form a complete sentence.

It is possible to form parsers for the English language, although given English's many idiosyncrasies, yacc may prove to be inadequate for the task. It may also be that the yacc programmer may become exhausted in trying to describe all the nuances of the language.

yacc was originally developed to provide a mechanism to develop compilers, but it could just as easily be used to create interpreters. For example, BASIC is often an interpreted language and could easily be described by a yacc grammar. Once yacc understood a particular line of BASIC code, it could cause the execution of the equivalent instructions in the native language of the host computer.

Some Linux distributions provide a choice of yacc programs. One can install either (or both) Berkeley yacc or the GNU bison program. You'll probably find them in /usr/bin. They are quite similar; bison was originally derived from yacc, although there has been some divergence over the years.

The combination of lex, yacc and some programmer's C code provides a complete means to interpret and act upon a user's wishes. The lex program uses its regular expression interpretation capability to recognize strings of characters as words or tokens. (The term “words” is used loosely to describe any recognized string of characters.) Once a token is identified, it is passed to yacc where the various rules are applied until some combination of tokens form a recognizable structure to which yacc applies some pre-programmed C code.

How The Tools Are Used

As indicated, lex uses regular expressions to recognize strings of characters as items of interest. Regular expressions are composed of special characters which describe acceptable combinations of characters.

For example, regular expressions often use the character . (period) to indicate that any character except a newline (\n) is acceptable.

Similarly, the characters [ and ] (square brackets) are used to indicate acceptance of any of the characters enclosed within them or within the range of characters described between them. For example, the expression [abc] says to accept any of the characters a, b or c; the expression [a-c] says the same thing. A more complicated example might be [a-zA-Z0-9] which says to accept any alphanumeric character.

For a complete description of lex regular expression syntax, see lex & yacc by Levine, Mason and Brown (O'Reilly, 1992).

Once a regular expression matches the text stream being interpreted by lex, code created by the programmer is executed. When used with yacc, this code generally amounts to passing an indication of what was just recognized to yacc for further processing. The indication is a token that yacc knows about, and in fact, these are defined in the yacc portion of the analyzer/parser program so that they are common to both lex and yacc.

Also as indicated, yacc uses a grammar description to decode the meaning of the tokens that lex passes to it. As tokens are passed to yacc, it applies its rules until a single token, or some sequence of tokens, becomes a recognizable structure.

Before a programmer's C code is executed, though, yacc may require several structures or token-structure combinations to be recognized. For example, using our sample sentence, our rules might look like the following:

sentence  :  subject + predicate
{...execute some C code...}
subject       :  noun
              |  noun_phrase
predicate     :  verb + noun_phrase
noun_phrase   :  preposition + adjective + noun
              |  adjective + noun

The first rule says that a sentence is made up of two parts: a subject followed by a predicate. If that rule is satisfied, then the C code between the curly brackets will be executed. To satisfy the first rule, yacc has to recognize both a subject and a predicate. The subsequent rules help yacc to do just that.

For example, the second rule says that a subject is recognized when either a noun or a noun phrase is recognized. A noun is the smallest unit that yacc deals with, and in the yacc grammar, a noun is a token that yacc will want to have lex recognize. Thus, somewhere in the yacc program, a token will be defined (probably called NOUN) that lex and yacc will use to communicate the fact that a noun has been interpreted. How this is done we will see shortly.

Notice that a noun phrase is also used to create the predicate. If a verb is recognized and it is followed by a noun phrase, the predicate is identified. If only the noun phrase is identified, then the subject has been identified.

The example cited is not in yacc syntax, but is meant to provide understanding. Very detailed examples may be found in the references.

You may be wondering how the yacc parser actually works. yacc works as a finite-state machine, and it has a stack (think of this as a memory of what has been seen, as it tries to deduce what the incoming stream of tokens represents).

A finite-state machine records its current condition as each recognizable item is interpreted. For example, as a noun phrase is being interpreted, it moves from state 3 when it receives a preposition to state 4 when the adjective is interpreted and finally to state 5 when the noun is recognized. When the entire phrase has been recognized, it switches to another state, perhaps 37, to note that fact. Please do not attach any particular meaning to the numbers used in this example. They have been chosen arbitrarily to demonstrate how yacc progresses as it interprets the tokens received from lex. You should conclude only that to reach state 5 (noun phrase), yacc must progress through several preceding states, each of which might lead to another final state, depending on the grammar yacc is using.

In other words, given its current state, yacc requests from lex the next token (if it needs it) and places onto the stack its new state. In doing so, it may push the new state onto the stack (as when interpreting the noun phrase), or pop the old state off the stack, replacing it with a new state (as when the noun phrase is completely recognized). These actions are called “shift” and “reduce” and describe pushing and popping states to and from the stack.

When the sentence is finally recognized, yacc accepts it and returns to the calling program (the main program which invoked yacc and indirectly lex). For a complete description of how a yacc parser works, see Inside Xenix by Christopher Morgan, Howard W. Sams and Co., 1986. This reference describes yacc grammars and how yacc parses its input in exquisite detail.

Basic Coding of lex and yacc Programs

Both tools are coded in a similar manner. There are three sections in each program: declarations, rules and user routines. Each is separated by a line containing only the two characters %%.

For yacc, the declarations section contains the tokens it can use while parsing the input. Each has a unique value greater than 256, and the set of tokens is introduced via %token at the beginning of the line. lex can use the declarations section to define aliases for items that it must recognize while looking for tokens to pass to yacc.

For example, lex needs to know about white space which, while not used in identifying tokens, must be accounted for in some way. Similarly, mathematical symbols such as + or = must be recognized. These are needed in the interpretation of the algebraic statement coded by the user.

Within the rules section, yacc holds its parsing intelligence. This is the section that contains the grammar rules referred to in the English sentence example earlier. In fact, the coding used earlier is typical of a yacc grammar: items to be recognized are separated from the means to recognize them by a colon (:), and alternative means of recognition are separated from each other via a pipe (|) symbol.

lex uses the rules section to contain the regular expressions that allow it to identify tokens to pass to yacc. These expressions may be the aliases from the declaration section, regular expressions, or some combination.

The last section contains C code which may be invoked as each of the tools processes its input.

One requirement is that the yacc tokens be known to the lex program. This is accomplished by including the following statement:

#include "y.tab.h"

in the lex declarations section and creating it when compiling the yacc program code.

Compilation is accomplished in the following way:

yacc -d yacc.file -create 'y.tab.c and y.tab.h'
flex flex.file -create 'lex.yy.c'

The -d option on yacc's command line creates the y.tab.h file needed by lex.yy.c.

How lex and yacc were employed in Log Analysis

To successfully interpret the user's desired process, the program needs to know which well logs were available for processing. This information is available in the ASCII file selected by the user. This text file contains a one-to-one correspondence between curve description and data values. A very small subset of a typical file is shown in Listing 1.

Listing 1

As can be seen, there are several sections including well information (includes some hole parameters), curve information (notes which curves are in the file) and “A” which holds the recorded data values. Each is introduced with a tilde (~). Because the format of the file is fixed by convention, these are easily parsed, and needed values are stored for subsequent processing.

As written for the client, the program is a Motif application. The user selected the file to be processed; it was read in its entirety and numeric text items were converted to double-precision values.

Besides allowing file and curve merging and editing, there is a menu item for curve processing. Upon selecting this menu item, a dialog box is presented containing a list of available curves and arithmetic operations. The user selects curve names, numeric constants and operations which in turn are displayed as an algebraic operation on a text input line. When satisfied with the mathematical operation, the user clicks OK and the lex and yacc magic occurs. The result is stored as a separate curve and can be included in subsequent operations.

lex processed the incoming algebraic statement with the code shown in Listing 2.

Listing 2

Between lines 1 and 16 are declarations to be used in the program generated by lex. In particular, you will notice the inclusion of the header file y.tab.h which contains the following definitions:

#define INTEGER 257
#define FLOAT 258
#define DOUBLE 259
#define NUMBER 260
#define VARIABLE 261
#define EQUAL 262
#define LPAREN 263
#define RPAREN 264
#define PLUS 265
#define MINUS 266
#define TIMES 267
#define DIVIDE 268
#define RAISE 269
#define LHS 270

These definitions are used by both lex and yacc to describe the items yacc expects to receive from lex. They are generated by statements 73 to 77 of the yacc source which will be examined shortly.

From lines 17 to 31 of the lex listing are declarations which amount to aliases for common items that we wish lex to recognize. For example, we declare DIGIT to be any single numeric between 0 and 9 on line 21. Doing this allows us to declare INT (an integer) to be one or more DIGIT's.

Lines 33 to 90 contain the rules by which lex interprets incoming text items. For example, on line 34 we recognize an equal sign (=) and return the token EQUAL to the caller. In y.tab.h, EQUAL is defined to be 262.

As you can see, the lex rules simply recognize text items and then advise the caller what was seen in the input stream.

Listing 3

yacc interprets the token stream passed to it by lex with the following code, only a subset of which is shown in Listing 3. The code for the yacc routine (with the calling subroutine do_arithmetic and its accessory functions) was in excess of 900 lines. For those interested, it is available for your perusal from SSC's public FTP site. Listing 3 is a sample indicating what needed to be done.

Like the lex routine, yacc begins with lines to be included in the output code. Programs written for graphical user interfaces sit in a loop waiting for the user to click on something. When the user's needs are so indicated, the GUI-based program calls a function to perform the required action. These “called functions” are popularly called callbacks. In this program, one of the callbacks was do_arithmetic, which in turn called the yacc routine, which in its turn called the lex routine.

In Listing 3, do_arithmetic is described in the first few lines, and a portion of the code may be seen in lines 428 to 532. They are shown only to give some indication of what was being accomplished.

yacc does the work with its rules section beginning at line 79, and ending at line 426. Although too long to be included completely, you can see that an equation is defined to be something called an lhs (left hand side) EQUAL rhs (right hand side) at line 80. Looking down the listing, you will see that an equation may also be described by an expr (expression). When either of these are encountered, yacc pops a key from an internal stack created by a function called push (see near line 557) and then causes a log curve to be returned to the caller by calling another function called get_curve (not shown here, but included with the yacc and lex code).

Between lines 118 and 139, you can see how some of the tokens yacc expects are processed when recognized. The complete listing has many more.

Results

The lex, yacc and supporting code was successfully employed to allow the log analysts to process various log curves. To have written the C code to accomplish the lexical analysis and parsing logic would have taken much longer than the four weeks allowed. As it turned out, this code was much easier to create and debug than it was to introduce into the final Motif application, even though it was written as a callback.

In fact, the number of lines of lex (152) and yacc (953) code were far fewer than the number of lines generated by the two (2765). Of course, one could take the time to write much tighter code than these general purpose tools deliver.

Nevertheless, should you be faced with a similar problem, I strongly recommend using lex and yacc. They are powerful, reliable tools worth knowing.

All listings referred to in this article are available by anonymous download in the file ftp://ftp.linuxjournal.com/pub/lj/listings/issue51/2227.tgz.