Updating Pages Automatically

Reuven M. Lerner

Issue #53, September 1998

Have a need to change a file on your web site on a daily or monthly basis? This month Mr. Lerner tells us how to do it.

The home page of my web browser is set to http://www.dilbert.com/, home of the famous and funny Dilbert comic strip. Thanks to the magic of the Internet, I'm able to enjoy Dilbert's tragicomic humor each morning, just before I start my workday.

The Dilbert web site would not be very useful or interesting were it not for the creative talents of Scott Adams, Dilbert's creator. What makes it interesting from a technical perspective is the way in which the comic is updated automatically each day. Every morning, the latest comic is automatically placed on the Dilbert home page, giving millions of fans the chance to see the latest installment.

This month, we will examine several ways in which you can create pages that are automatically updated, so that a user can discover new content at the same URL each day. We will look at several different means to the same end, ranging from CGI programs to cron jobs, and will even take a brief look at how to use databases when publishing new content.

Pointing with CGI

For starters, let's assume our web site consists of seven different pages, one for each day of the week (e.g., file-0.html on Sunday, through file-6.html on Saturday). How can we configure the site so that people requesting today.html (or today.pl) will be shown today's file? In other words, a visitor on Wednesday should be shown file-3.html when requesting today.html. Such a system might be appropriate for a school cafeteria, where the food tends to be the same each day of the week.

Perhaps the simplest solution is a CGI program, which we will call today.pl. If we write the program in Perl, we can easily determine the day of the week using the localtime function, which returns a list of elements describing the current date and time. Using the sixth element of that list, which indicates the current day of the week, we can create the correct URL for that day. Finally, we can use the HTTP “Location” header to redirect the user's browser to the correct location.

A simple implementation of this program is shown in Listing 1. The program should seem familiar to anyone who has written CGI programs. It enables all of Perl's warning systems: -w for optional warnings, -T for extra security, strict for extra compile-time checking and diagnostics for more complete documentation if something fails.

By using CGI.pm, the standard Perl module for writing CGI programs, we gain easy access to any input passed by the server, as well as the various output methods a CGI program might use. Most CGI programs use the output methods meant for returning HTML to a user's browser, including sending a MIME “Content-type” header indicating the type of content about to be sent—in our case, we return a “Location” header, which removes the need for a “Content-type” header.

If the above program is installed as /cgi-bin/today.pl on our server, visitors will always be greeted with the appropriate file for the current day of the week.

The above program, simple as it is, has several flaws. Most significantly, CGI is slow and inefficient; using it to redirect the user's browser to another file will slow down the user's experience, as well as increase the load on your server. Each time a CGI program is invoked, the server must create a new process. If the program is written in Perl, this means the Perl binary must be started, which can take some time.

One solution might be to use mod_perl, which inserts a fully working version of Perl into the Apache web server. Using mod_perl means Apache no longer needs to create a new process, execute the Perl binary or compile the Perl program, which will cut down on server resource use. However, this still means that each time a user requests the home page, the server must execute a program. If the page is requested 1,000 times in a given day, then the program will run 1,000 times. This might not sound like much, but imagine what happens when your site grows in popularity, getting 1,000,000 hits each day.

Even this solution doesn't address the fact that not all users run browsers which handle redirection. If a browser does not handle the notice, the user will be unable to see today's file. This problem is increasingly rare, but keep it in mind if you want the maximum possible audience for your web site.

Automatically Copying Pages with cron

Let's now examine a strategy in which the program runs only once per day, regardless of how many people ask to see today's page. This method reduces the load on the server and allows people with old browsers to visit our site without any trouble. The easiest strategy is to use Linux's cron utility, which allows us to automatically run programs at any time. Using cron, we can run our program once per day, copying the appropriate file to today.html. On Sundays, file-0.html will be copied to today.html, while on Thursdays, file-4.html will be copied to today.html.

Listing 2 is an example of such a program. If this program were run once a day, then today.html would always contain the file for the appropriate day. Moreover, the server would be able to respond to the document request without having to create a new CGI process or use Perl.

The above program is not run through CGI, but rather through cron. In order to run a program through cron, you must add an entry to your crontab, a specially formatted text file that describes when a program should be run. Each user has a separate crontab file; that is, each user can arrange for different cron jobs to run at different dates and times.

You can edit the crontab file using the crontab program, which is typically in /usr/bin/crontab. To modify your crontab file, use crontab -e, which brings up the editor defined in the EDITOR environment variable. The format of crontab is too involved for me to explain here; typing man 5 crontab on the Linux command line will bring up the manual page describing the format. (Typing only man crontab will bring up a description of the crontab program, rather than the crontab file format, a distinction which can be confusing to new users.)

Assuming we want to run the above program (which I have called cron-today.pl) at one minute after midnight, we could add the following entry to our crontab:

1 0 * * * /usr/local/bin/cron-today.pl

In other words, we want to run /usr/local/bin/cron-today.pl at one minute after midnight (1 0), every day of the month (*), every month (*), and every day of the week (*).

The output from each cron is e-mailed to the user who owns that job. After installing the above line in my crontab, I receive e-mail from the cron job each day at approximately 12:01 a.m. And each day, anyone visiting our site was shown the correct file for today.html.

Using Symbolic Links

The above cron-based technique works, but has some annoying side effects. For example, what happens if you decide to change the Tuesday menu on Tuesday morning? The change will not be reflected until the following Tuesday, because today.html contains the contents of file-2.html from 12:01 a.m., when the snapshot was taken.

In order to solve this problem, as well as reduce the disk space used by two copies of the program, we can use symbolic links. These look like files, but are really pointers to files, similar to Macintosh “aliases” or Windows “shortcuts”. If we create a symbolic link from today.html to file-0.html, the two file names will be equivalent for most purposes. (Other “hard” links are also available under Linux, but are more limited.)

If we want to create a symbolic link named today.html that points to file-0.html, we say

ln -s file-0.html today.html

If you want to change the link so that it points to file-1.html, remove the old link and create a new one, like this:

rm -fv today.html
ln -s file-1.html today.html
Alternatively, we can use the -f (“force”) option to ln, forcing the link assignment even if it was previously linked elsewhere:
ln -sf file-0.html today.html
If we were to do this each day, removing the old link and creating a new one, we would be doing effectively the same thing as in cron-today.pl, but with the added advantage of equating the two files. In addition, we would be saving space on the file system by pointing to the original file rather than copying it.

Listing 3 contains a short Perl program meant to be run via cron, which creates such a link. Anything sent to standard output (STDOUT) via “print” statements is sent to the owner of the cron job. This program assumes the owner of the cron job (under whose user ID the program is run) has permission to remove the existing file, as well as create a new symbolic link in the directory. It is possible to create a symbolic link to any file, including a nonexistent file; only when you try to access the file are the permissions checked.

Publishing Daily Items

The techniques we have examined so far are most useful when the same item appears each week or perhaps each month. In many cases, though, publishing on the Web involves creating a new file each day and making that available. For starters, we will look into how to create a new file each day (of the form file-1.html, as before), so that the newest file will be available by looking at today.html.

Once again, we could accomplish this with either a CGI program or a cron job, examples of which you can see in Listing 4 and Listing 5, respectively. Both programs use the same basic algorithm to find the highest-numbered file of the form file-n.html, where n is the sequential number for the file.

The key to both programs is in these lines:

if (opendir(DIR, $directory))
{
@files = sort by_number
   grep {/^file-[0-9]+\.html$/} readdir(DIR);
closedir DIR;
}

First, we open $directory, the directory in which the files exist. (If the program cannot open the directory, it logs an error.) We then read the contents of the directory DIR, using Perl's grep function to filter out any files not fitting the file-n.html pattern. Finally, we sort those files with our own by_number routine, which compares the sequential numbers rather than the full file name.

Once we have the list of files, we pick off the last element of @files, which has the highest sequential number. We can then redirect the user's browser to that file using CGI.pm's redirect method.

If we want to publish items each day, we should try a better system than this one, which depends on sequential numbers. First of all, it is easier to handle file names which mention the subject (e.g., menu.html) or the date (e.g., file-1998-06-01), rather than something named with sequential numbers, as in file-3023.html.

Secondly, arranging articles by date provides users with a natural way of navigating through archives in the future without having to depend on the site's navigation scheme. In addition, creating file names according to date rather than sequential numbers decreases the chances of error.

If you choose to use the date in the file name, as in file-1998-06-01, try to keep the date elements in year-month-day order, so that sorting file names alphanumerically will also sort them chronologically. Then, we can write a small program to select the file for today based on the date and run it each day with cron. An example is shown in Listing 6. The program logic is fairly straightforward, taking the date information from our call to localtime and piecing those elements together to create the file name.

However, problems may arise if the file for today does not exist. As I mentioned earlier, symbolic links do not have to point to files; they may point to any valid file name, even if no file by that name exists. However, if the symbolic link points to a non-existent file, users will be greeted with a dreaded “404--File not found” error upon loading today.html from our site. A more sophisticated version of this program would check to see if a file corresponding to today's date existed on the site. Such a program would then search backward (or forward, if you prefer) chronologically to find the best match for the today.html symbolic link. It could even send e-mail to the webmaster indicating that such a problem existed.

Using Databases

One additional method for publishing material on the Internet regularly is using databases. Rather than relying on file names keyed with particular dates, we can create a table that establishes a correspondence between file names and dates. We can then write a CGI program to retrieve the current file or a program meant to be run via cron to create a symbolic link to the current file.

Another option is to store the files inside of the database. However, if we were to do that, we would also have to make it possible for the site's editors and designers to store, retrieve and edit the information inside the database. For our purposes, we will assume the files exist on the server's file system, and we are trying to point to them rather than store their contents in a different way. These examples were tested under Red Hat 5.1, Perl 5.004_04, the database interface (DBI) libraries for Perl, and MySQL, a mostly free relational database system available from http://www.mysql.com/.

Before we can do anything else, we have to create a table to hold the information. The table will be relatively simple, containing only file names and dates. We will assume that each article can be published on only one date, but that each date can contain multiple articles, which makes our table creation command look like the following:

CREATE TABLE Articles
 (filename  VARCHAR(100) NOT NULL PRIMARY KEY,
 date    DATE NOT NULL);

In the above, we define filename as a 100-character text field, which must be filled in (NOT NULL) and cannot be the same as any other file name (PRIMARY KEY). If we try to insert the same file name on two different dates, the database will stop us. By contrast, because we want to allow more than one file on a given date, the date field (which has a type of DATE) is defined as NOT NULL, meaning that we must indicate a date with each file name.

In order to add a file to our database, we can use the following SQL command:

INSERT INTO Articles (filename, date)
VALUES ("foobar.html", "1998-06-05");

If you are using MySQL, you must put quotation marks around the date, or the default date of 0000-00-00 will be inserted.

In addition to the confirmation message (1 row affected) we receive upon submitting the above query, we can check the contents of the table:

mysql> SELECT * FROM Articles;
+-------------+------------+
| filename    | date       |
+-------------+------------+
| foobar.html | 1998-06-05 |
+-------------+------------+
1 row in set (0.08 sec)

Entering information into a database using raw SQL is inefficient, prone to errors and unhelpful for users who are unfamiliar or encomfortable with SQL. Listing 7 contains an HTML form that can be used to enter new articles into the database, using the program in Listing 8.

Finally, we will need a version of today.pl that retrieves the file for today. A CGI version of the program is in Listing 9; rewriting it such that it uses cron should be fairly straightforward. A more sophisticated version of the program would even check to see if the named file exists, searching backward.

Publishing regular articles on the Web is far less complicated than publishing a daily or weekly newspaper, but still involves a bit of planning and programming. In addition, no matter what method you choose, you will still have to make some trade-offs between performance and flexibility. Nevertheless, creating a page that changes each day and provides access to the site's archives is not especially difficult and can provide enough variety to draw people.

All listings referred to in this article are available by anonymous download in the file ftp://ftp.linuxjournal.com/pub/lj/listings/issue53/3060.tgz.

Reuven M. Lerner (reuven@netvision.net.il) s an Internet and Web consultant living in Haifa, Israel, who has been using the Web since early 1993. In his spare time, he cooks, reads and volunteers with educational projects in his community.