Reading E-mail Via the Web

Reuven M. Lerner

Issue #61, May 1999

How to write your own program to read and send mail to any server on the Internet.

E-mail is one of the unsung heroes of the Internet. The Web makes the Internet fun and interesting and allows me to keep up with most newspapers and magazines from the comfort of my Haifa apartment. E-mail allows me to keep in touch with friends, family and clients, as well as receive electronic newsletters in a convenient format.

I usually travel with my trusty Linux laptop, which means that with the help of a telephone line, I can dial in to my Internet provider and download the latest mail. However, on some occasions I cannot dial in to check my mail, even though I have full Internet access and a web browser. I could get an account at Hotmail, but Hotmail allows you to read mail sent to its server only, not to any mail server on the Internet.

This month, I will show you how to develop a set of CGI programs to read e-mail from any POP server. These programs do not provide a full-fledged e-mail client, but they do fill a niche and are useful in certain circumstances. The software described this month should demonstrate how relatively simple it is to create such applications and will have the added bonus of providing basic functionality for the times when you are away from the office.

What is POP?

Traditionally, e-mail on UNIX systems is stored on the user's computer. If you have an account on a UNIX system, e-mail sent to you is placed in a file on your computer. I receive mail on my Linux system in the file /var/spool/mail/reuven.

However, this system became inadequate over time for a variety of reasons. As users began to have their own full-fledged UNIX workstations rather than terminals connected to a central computer, system administrators wanted to centralize incoming mail on a single server.

The answer was POP, “post office protocol”. Rather than retrieving mail from a file on their own system, users would download it from the POP server, with a single POP server per work group cluster. A POP server typically stores incoming mail in a traditional UNIX-style file, but allows retrieval and deletion of individual messages via the network. Just as some cities and towns require their residents go to a central post office in order to retrieve letters and packages, POP requires users to retrieve their mail from a central server.

POP has gone through a number of updates over the years, with the most recent update named POP3. Over time, additional functionality has been added, but the basic commands have remained the same. POP allows users to check if they have mail, retrieve one or more messages and delete one or more messages.

Users are generally shielded from the underlying mechanics of POP3. Most modern e-mail programs support POP3. Indeed, e-mail programs on non-UNIX systems depend on the existence of POP3 servers, since they are rarely able to run mail servers known as “mail transport agents” or “MTAs”. Sendmail and qmail are two examples of MTAs.

Net::POP3

Before writing a CGI program to read our mail, we must understand how the program can accomplish this feat. We could write our own software to talk to a POP3 server, but as is often the case with Perl, a module already exists to handle this for us. In this particular case, the module is Net::POP3, part of the “libnet” package of network modules available on CPAN. (For more information on CPAN and its mirrors, go to http://www.cpan.org/.)

Net::POP3 provides an object-oriented interface to POP, making it possible to connect to a POP server with only a basic understanding of how the protocol works. Import the module with

use Net::POP3;

then create a new object with

my $pop = new Net::POP3($mailserver);
where $mailserver is a scalar containing the name of our POP3 server. If the connection is successful, $pop will be an object with methods allowing us to read and delete messages on the mail server. If the connection is unsuccessful, $pop will be undefined. Now all methods in Net::POP3 work this way, returning undef if the call was unsuccessful. The following code checks for this condition:
die "Error connecting to $mailserver."
   unless (defined $pop);
In order to ensure e-mail remains private, POP3 servers require users to log in with a user name and password. The login method accomplishes that, returning the number of messages waiting for the user:
my $num_messages = $pop->login($username,
   $password);
die "Error logging in." unless (defined
   $num_messages);
Again, notice the test to see whether $num_messages is defined. If it is undefined, then a mistake probably occurred in either the user name or password.

Each message on the POP server is identified with an index number, ranging from 1 to $num_messages. The index number should stay constant during a single POP session, but will change during future sessions. You can use the index number to read or delete a message:

my $message_ref = $pop->read($index);

If message number $index exists, the message headers and body are put into an array reference. Thus, if $index points to a message on our POP server, $message_ref is an array reference. Each element of the array contains a single line of text from the message. We can print the contents of the message by dereferencing $message_ref:

print @$message_ref, "\n";

print-mail.pl

Now that we have seen how Net::POP3 allows us to retrieve and read mail from a POP server, let's look at how we can integrate it into a CGI program. First, an HTML form is needed as a way to enter a user name and password. Here is a simple one:

<HTML>
<Head>
<Title>Read your mail!</Title>
</Head>
<Body>
<H1>Read your mail!</H1>
<P>Enter your user name, password, and POP server.</P>
<Form method="POST"
action="/cgi-bin/print-mail.pl">
<P>POP server: <input type="text" name="mailserver"></P>
<P>Username: <input type="text" name="username"></P>
<P>Password: <input type="password" name="password"></P>
<P><input type="submit" value="Show me my mail!"></P>
</Form>
</Body>
</HTML>

The above form sends three parameters to our CGI program—the name of the POP server from which to download the mail, the user name and the password. If you are concerned about the password being sent in the clear, you might want to put the form and CGI program behind a server running SSL, the secure sockets layer. You might also want to investigate POP3's APOP login method, which hides the password somewhat.

The program for reading mail is fairly simple; see Listing 1 in the archive file, ftp://ftp.linuxjournal.com/pub/lj/listings/issue61/3359.tgz. The code starts by creating an instance of CGI, providing an object-oriented interface to the CGI protocol. Then an appropriate MIME header is sent to the user's browser, indicating the response will be in HTML-formatted text. Next, the three pieces of information necessary for retrieving the user's mail are grabbed: the name of the POP server, the user name and the password.

Once that information is retrieved, we try to connect to the POP server and log in. Normally, invoking die is a bad idea in a CGI program, since it results in a difficult-to-understand message appearing on the user's screen. However, since we ported CGI::Carp and specified fatalsToBrowser, any invocations of die will send a description of the error message to the browser as well as to the web server's error log. This can be an invaluable tool when debugging, even if your final production code requires you to hide potential error messages.

Once the number of messages waiting on the POP server is known, we can retrieve them with a simple loop:

foreach my $index (1 .. $num_messages)
{
   print "<H2>Message $index</H2>\n";
   my $message_ref = $pop->get($index);
   print "<pre>\n", @$message_ref, "</pre><HR>\n";
}

We enclose the mail within <pre> and </pre> tags, since most e-mail depends on fixed-width fonts and formatting.

You may be surprised such a simple program can be used to read your mail, but it does and should work on any system with any web browser. It can be used to quickly check if any new mail has arrived, without affecting your ability to download and read messages with your usual e-mail program.

Ignoring Uninteresting Headers

As is often the case with new programs, our first stab was functional but is missing some useful features. For instance, most users do not need to see all of the headers that come with a message. Typically, they want to see only the “From”, “To”, “Subject”, “Cc” and “Date” headers.

Perl makes it a snap to remove unwanted headers by using regular expressions. Headers can be thought of as a name, value pair separated by a colon. On the left side of the colon is the header name, which can consist of any alphanumeric character or a hyphen. On the right side of the colon is the header's value, which can consist of almost any character.

One consideration is the possibility that a header will be spread across multiple lines. That is, the two lines

Subject: This is a subject header
   that continues onto a second line

should all be considered part of the “Subject” header, since the second line begins with one or more white-space characters.

This problem is solved by creating a hash, %KEEP, in which the keys name the headers to keep. For example:

my %KEEP = ("To" => 1,
   "From" => 1,
   "Subject" => 1,
   "Date" => 1);

The code then checks if a header is to be kept by checking the value of $KEEP{$header_name}, where $header_name contains the value of the header to check.

Before anything can be done to the headers, they must be put into a scalar separate from the message body. Do that with split:

my ($headers, $body) = split "\n \n", $contents, 2;

Notice split has three arguments, telling Perl to split $contents into a maximum of two elements. If the 2 were omitted, $body would contain only the first paragraph of the message, rather than the entire text.

Once the message headers are stored in $headers, it can be split back into an array, and the code can then iterate through the array elements. Each element of @headers is a single header line, which might mark the beginning of a new header or the continuation of an existing one. If this is a new header and its name is in %KEEP, the header is written to the user's browser. If the header's name is not in %KEEP, it is ignored and the program goes on to the next line.

This does not solve the issue of multi-line headers. This is handled by assuming that every line in @headers will begin with either a header (e.g., Received: or X-Mailer:) or with white space. If the pattern at the beginning of the line matches a header value, the program checks %KEEP and if found, prints the line. If the pattern fails to match a header value, it is assumed to be white space, and the line is printed only if the previous line was printed.

Here is some basic code to print the headers:

my @headers = split "\n", $headers;
my $previous = "";
foreach my $line (@headers)
{
   if ($line =~ m/^([\w-]+):/i)
   {
      $previous = $1;
   }
   print $line, "\n" if $KEEP{$previous};
}

This code is contained in Listing 2, better-print-mail.pl, in the archive file. This is an improved version of our original bare-bones program, incorporating this and other changes.

Handling HTML

Displaying e-mail messages in a web browser has advantages and disadvantages. On the one hand, we must be careful to turn special characters, such as < and >, into their literal equivalents. At the same time, we can take advantage of the web browser to make e-mail addresses and URLs clickable.

Since we want to ensure that characters appear in the headers as well as in the message body, we modify $contents, the variable that contains the entire message contents, before separating the header and body. We turn < and > into < and >, respectively, ensuring that literal text will not be interpreted as if enclosed in HTML tags:

$contents =~ s/</</g;
$contents =~ s/>/>/g;

Making e-mail addresses clickable requires the use of a regular expression to match e-mail addresses. I decided to use the following code:

$contents =~
   s|([\w-.]+@[\w-.]+\.[a-z]{2,3})|
   <a href="mailto:$1">$1</a>|gi;
which looks for any combination of alphanumeric characters, hyphens and periods, followed by an @, followed by the same combination of characters, followed by a two- or three-letter top-level domain. This ensures we will not accidentally turn something like
three pickles @ 20 cents/pickle
into an e-mail address. By turning an actual e-mail address into a “mailto” link, users can click on the link in order to send mail to that address.

Making URLs clickable is somewhat more difficult, since we have to handle more combinations. The code below appears to match a large number of URLs:

   s|(\w+tps?://[^\s&\"\']+[\w/])|
   <a href="$1">$1</a>|gi;

Here, we look for any letters ending with “tp”, with an optional “s” on the end. This allows us to match “ftp”, “http” and “https”, all of which are valid protocols. We then allow any combination of characters following the two slashes, excluding white space and several characters which cannot be transmitted in a URL.

Quotation marks and white space can be sent if they are URL-encoded first. Characters are URL-encoded when the hexadecimal value of their ASCII code is preceded by a percent sign. For instance, the space character is ASCII 32 or 0x20; thus, it can be sent in a URL as %20. CGI.pm automatically decodes such characters, so you need not worry about it in most cases.

The final part of our regular expression stipulates that the final character of a URL must be alphanumeric or a slash. This ensures that odd trailing characters, such as periods and commas, will not be accidentally dragged into the URL and highlighted.

Viewing Selected Messages

The above program works just fine, if you want to view all the messages in your mailbox. If you receive many e-mail messages, viewing all of them in a single long web document can get frustrating.

The program better-print-mail.pl takes into account the fact that we might want to view only a selected list of messages. For example:

if ($query->param("to_view"))
{
   @message_indices = $query->param("to_view");
}
else
{
   @message_indices = (1 .. $num_messages);
}

An HTML form element can be set multiple times, meaning that the element "to_view" might contain zero, one or more elements. All of those are put inside of @message_indices unless to_view was not set, in which case all messages are displayed by default.

How can we get a list of current messages? A program called mail-index.pl (see Listing 3 in the archive file) should do the trick. This program can be invoked from the same sort of form we have seen already; simply modify the “action” to point to mail-index.pl, rather than better-print-mail.pl. As with print-mail.pl and better-print-mail.pl, mail-index.pl must receive the user name, password and name of the mail server in order to function. With that information in hand, it logs into the POP server and displays the message headers for mail waiting to be read.

Each message is presented with a check box. By checking the box next to a message, the user indicates he would like to read that particular message. When the user clicks on the “submit” button, better-print-mail.pl is sent not only the user name, password and mail server, but also the list of checked messages. As we have seen, better-print-mail.pl already knows how to handle this list and prints only requested mail messages.

Conclusion

Setting up a web-based mail system is not all that difficult. I would hesitate before adding a delete function, since I would worry about deleting my only copy of a message. (My e-mail program makes automatic backups, so I never have to worry about that on my own computer.) However, adding such functionality would be quite easy, technically speaking.

Next month, I will show you how to build a system that allows you to send mail as well as read it. We will build on the software we examined this month, adding some functionality to it and tying it into our own mail-sending CGI programs. With a bit of software, you too can begin to compete with Hotmail!

Reuven M. Lerner is an Internet and Web consultant living in Haifa, Israel, who has been using the web since early 1993. His book Core Perl will be published by Prentice-Hall in the spring. Reuven can be reached at reuven@lerner.co.il. The ATF home page, including archives and discussion forums, is at http://www.lerner.co.il/atf/.