Creating and Using a Database with Perl

Randy Scott

Issue #35, March 1997

Perl 5 includes packages enabling your Perl scripts to create and modify databases in standard Unix formats. One of these databases can be a more efficient alternative than a flat text file (which Perl handles marvelously), and it will be compatible with other languages, like C.

Perl programmers, like programmers of any other language, typically need to store large amounts of data. For this data to be manageable, it needs to be stored in a conveniently accessible format. It never hurts to make the stored data easy to write, as well.

Even though Perl is an exceptional language for text processing, in many circumstances, a more structured database-like format offers quicker access. In addition, it may also be necessary for a Perl script to read or write a database that is also accessed through a C program.

To accomplish this, the Perl distribution includes packages that give a Perl programmer access to a variety of different database formats available in a Unix environment. These formats include: the Berkeley DB format, the Free Software Foundation's GDBM format and the NDBM format.

Associative Arrays

The associative array (or “hash”) is one of the most powerful data structures available to a Perl programmer. To those familiar with traditional arrays (in C, Pascal or Perl), an associative array can be thought of as an array indexed using an arbitrary string instead of an integer subscript. Basically, an associative array is a data structure that allows the programmer to associate one string—a key—with another—its value.

Here is an example of an associative array that can be used to convert the abbreviated name of a day of the week to its full name.

%days = (
        "Sun", "Sunday",
        "Mon", "Monday",
        "Tue", "Tuesday",
        "Wed", "Wednesday",
        "Thu", "Thursday",
        "Fri", "Friday",
        "Sat", "Saturday"
);

The % in front of the variable name days is used to tell Perl the variable is an associative array. As shown, associative arrays are initialized by using pairs of values that relate to each other.

To access the data stored in an associative array, you can use a syntax similar to the following:

$long_name = $days{"Sun"};

This expression will set the scalar variable long_name to the value associated with the key “Sun”, the string “Sunday” in this example.

You can see already that associative arrays can be a powerful tool for organizing data used inside of a Perl script. This technique can easily be extended to something more useful by creating values made up of more than one field. Take, for instance, this simple address book database where multiple fields in the associative array's value are separated by colons:

$phone_db = (
        "Bill Jones", "123 West Avenue:New York, NY:12345",
        "Jane Smith", "6789 1st Street:Chicago, IL:56789"
);

New entries in this database can be added with an expression like:

$phone_db{"Bill Smith"} = join(":", $street, $city, $zip_code);
Data can be extracted from this simple database with an expression like:
($street, $city, $zip_code) = split(/:/, $phone_db{"Bill Smith"});
As you can see, these arrays come in very handy for manipulating data inside a Perl script. However, how can we export this data easily to a file so our scripts or other programs can access the data? One simple method would be to use a text file with the fields of our database separated by colons. This method would make writing out the database from our Perl script very simple. It could be done using a piece of code like the following:
while (($name, $record) = each %phone_db) {
        print "$name:$record\n";
}
This method does not lend itself well to performing a search through the file, as we would need to read, on average, half the lines in the file in order to find the record we are seeking. In addition, writing code to search such a file in other languages (C, for instance) may not be as simple as the Perl script we have written.

To solve this problem, Perl supports “binding” associative arrays to the various types of database formats mentioned above. This allows a Perl programmer to create, access and update databases in the popular Unix database formats as easily as performing operations on an associative array.

Database Support in Perl

Perl version 5 includes a set of “packages” that manipulate the various database formats. These packages are:

  • DB_File—for Berkeley DB databases

  • GDBM_File—for the Free Software Foundation's GDBM databases

  • NDBM_File

  • ODBM_File

  • SDBM_File

To use any of these database packages, a Perl programmer must include the package at the beginning of the script using the following statement:

use DB_File;
In addition, the Fcntl package also needs to be included. This is accomplished by including the following at the beginning of the script:
use Fcntl;
Man pages are included in the Perl distribution for each of these packages. For simplicity's sake, only the DB_File package and its associated Berkeley DB databases are discussed in this article.

Opening a Database

Databases are opened in Perl using the tie() function. This function is responsible for “joining” an associative array with a database package. Operations performed on the associative array are then translated by the database package into function calls that operate on the database file itself.

Here is an example of opening a database named “phone.db” using the DB_File database package:

tie (%phone_db, DB_File, "phone.db") ||
        die ("Cannot open phone.db");

This command binds the associative array named phone_db to the Berkeley DB database file named “phone.db”. In this example, the file must exist and must be readable by the Perl script.

Creating a Database

Creating a database is nearly as simple as opening one. The following command will create a database named “phone.db” in the current directory with the file's permissions set to read-write for the owner and read-only for everyone else. The file will be created only if it does not already exist. If the database file exists in the current directory, the database file will simply be opened for read-write access by the Perl script.

tie (%phone_db, DB_File, "phone.db", O_CREAT|O_RDWR, 0644) ||
        die ("Cannot create or open phone.db");

The O_CREAT and O_RDWR flags are the same flags used as parameters to the Unix open() system call. They specify that the file should be created if it does not exist and opened with read-write access.

Reading from the Database

Reading from the database works exactly like reading data from an associative array. If the key is known, specific records can be read from the file with an expression like:

$record = $phone_db{"Bill Smith"};

All the records in the database file can be scanned (in a seemingly random order) with something like:

while (($name, $record) = each %phone_db) {
        [ commands to process data here ]
}
During each pass through the while loop, the $name scalar variable will be set to the key value from the database, and the $record variable will be set to the data associated with the key.

Writing to the Database

New data can be written into the database by creating a new key in the associative array and setting the key's value. This is done with a command similar to:

$phone_db{"Bill Smith"} = $data;

where $data is the information to be associated with the key “Bill Smith”. Any changes made to the associative array will be written into its corresponding database file.

Deleting Items from the Database

Keys can be removed from the database in exactly the same way items are removed from an associative array in Perl—by using the delete() function. The following code removes the record in the database that refers to “Bill Smith”.

delete $phone_db{"Bill Smith"};

Closing the Database File

Changes to an associative array may not be immediately written out to the database file. To insure that changes are successfully written to the database file, the file must be closed.

Closing the database file involves un-binding the associative array from the database package. This is done with the untie() function in the following manner:

untie(%phone_db);

This closes the database file, making updates to the file if necessary. The associative array %phone_db can now no longer be used to access the records in the database.

Other Types of Databases

All of the examples provided here use the default type of Berkeley DB database, the DB_HASH type. This form of database uses a hash table (like Perl does) to store the keys and their values in the database file. Two other types of databases are provided with the Berkeley DB package: DB_BTREE and DB_RECNO.

The DB_BTREE format uses a sorted, balanced binary tree to store the key and value pairs. This format allows data to be stored and read in a sorted order as opposed to the seemingly random order the DB_HASH format produces. The default comparison routine sorts the keys in the database file in lexical order (alphabetically). The DB_File man page discusses this format in more detail and shows how to replace the default comparison routine with one of your own.

The DB_RECNO format is designed to operate on flat text files. It is bound (with tie()) to normal Perl arrays, not associative arrays. Indexing this array with a number provides the text found on that line of the database file. This format is also discussed in more detail in the DB_File man page.

The desired format of database file is specified with an additional parameter for the tie() function.

tie (%phone_db, DB_File, "phone.db", O_RDONLY, 0644, $DB_BTREE) ||
                die ("Cannot open phone.db");

This command will open the DB_BTREE database named “phone.db” in read-only access mode. If the file does not exist, the command fails.

Other Fun Stuff with Associative Arrays

Sometimes it is necessary to sort an associative array within a Perl script. Sorting by the key values of an associative array is done like this:

for (sort keys %phone_db) {
        print "$_ = $phone_db{$_}\n";
}

Each iteration of this loop will set the $_ scalar to a key value from the associative array provided in alphabetical order. This method works very nicely for sorting associative arrays by their keys. Sorting by an associative array's values is slightly more difficult:

sub sort_by_value {
        ( $phone_db{$a} cmp $phone_db{$b} ) || \
( $a cmp $b );
}
for (sort sort_by_value keys %phone_db) {
        print "$_ = $phone_db{$_}\n";
}
This piece of code substitutes the default routine that sort() uses to order the elements it is given with a special routine. This routine, sort_by_value, sorts the associative array first by the values, and secondly by the keys (i.e., when the two values are identical, compare their respective keys to determine which should appear first).

Keep in mind that these two methods for sorting an associative array do not actually rearrange the array in any fashion. They simply provide a way to pull every key and value pair from an associative array in a particular sorted order.

Putting It All Together

An example of how databases in Perl can be used is provided in Listing 1, a short script designed to keep a database of hits on a World Wide Web site. The script reads the NCSA HTTPD access log file, stores the information in the database and creates an HTML page that displays all the statistics for the site.

Listing 1. Example Web Site Hits Database Script

This implementation is not complete—it keeps track only of which documents were accessed and their sizes. A more complete implementation could also store information about the hosts that accessed the web server, for instance. Some method for “expiring” entries in the database after a particular time interval would be a handy feature as well.

The script begins by reading the existing database file and placing all the data into associative arrays indexed by the document file name. Next, the script reads the access log file from standard input and places the data into the associative arrays that store the statistics. Finally, the script creates an HTML page using tables to display the statistics.

Conclusion

The topics provided in this article are by no means a definitive reference guide for using the built-in database support included with Perl, but they can be used as a starting point for further experimentation and exploration.

Randy Scott is a senior Computer Engineering student at the Milwaukee School of Engineering. He been programming with Unix and C for nearly three years and has become an avid Perl fan in the last six months. Any questions or comments regarding this article can be sent to scottr@bork.com.