Don't be afraid of large joins—learn to generate complex SQL queries from easy-to-use interfaces.
Over the last few years increasing numbers of web sites have begun to integrate a relational database management system (RDBMS). Databases excel at storing and retrieving information quickly and easily and make it possible for web developers to create sophisticated applications without getting bogged down in the details.
While it is not hard to integrate a database into a web site, it can sometimes be tricky to design the database tables, as well as construct sophisticated applications to use them.
Last month we looked at (normalization) the technique that makes it possible to reduce potential errors while increasing the speed and flexibility of our database. This makes it easier to create a variety of applications with the same data. This comes at a price: the queries needed to retrieve information from the database become more complex, often joining data from two, three or four tables. With a bit of training, however, programmers can easily overcome their fear of large joins, using them to retrieve only the information they need.
Most users are not programmers, and it is unrealistic to expect them to construct complex queries on their own. The trick is to create a user interface that makes it possible to answer a large subset of the possible questions, without overwhelming users with a variety of options.
This month, we will spend some time looking at how to generate complex SQL queries from relatively easy-to-use interfaces. Our examples will draw on the example tables from last month, which describe the Israeli train system.
In the end, we will write two types of programs. Some programs (often known as “query generators” in the database trade) will generate HTML forms that can be used to create queries. Other programs will translate the HTML form into an actual SQL query, displaying the results in the user's browser.
Linux users often sneer at the Macintosh interface, because it limits the available options. At the bash prompt, a Linux user can invoke many thousands of commands. Furthermore, the output of any program can be piped to a file or another program. This multitude of options makes Linux a particularly powerful system, yet one that is difficult to learn and master.
Most users are not interested in power, but simply want to get their work done. Giving users too much flexibility can sometimes be a hindrance. In such cases, we want to limit users' options, forcing them to make a choice that our programs can handle. This is one of the key advantages of a GUI, it reduces the chance for user error by reducing the number of available options. It also reduces the number of potential inputs a program might receive, making it easier to test applications.
For example, consider an HTML form that asks users to enter their destination train station. Many HTML forms use a text area to input such information as follows:
<input type="text" name="destination" size="30">
Allowing users to enter a station name in this way opens Pandora's box, forcing the program to handle misspellings, capitalization issues and even problems stemming from whitespace. Databases might be good at many things, but they do require precise inputs that match exactly.
We can improve the situation and simplify our program by limiting the user's actions with a <select> list. Each <option> in this <select> list will correspond to one row from the RailStations table, with the value attribute set to the row's “ID” and the user-visible text set to the name attribute. For example, we could do the following:
<select name="destination"> <option value="1">Nahariya <option value="2">Akko <option value="3">Hof Hacarmel <option value="4">Tel Aviv Central <option value="5">Tel Aviv Hashalom <option value="6">Lod <option value="7">Rehovot <option value="8">Herzliya </select>
The above <select> list provides more reliable input than a text field. However, it presents several problems. First and foremost, placing the above static HTML in a file means the RailStations table and the <select> list will inevitably go out of synch. From the user's perspective, the above <select> list is difficult to use, because it orders the items according to their ID numbers rather than their station names.
We can solve both of these problems with our query generator, a CGI program that produces HTML forms based on information in the database. Listing 1 (see Resources) contains a simple non-CGI program, select-list-from-table.pl, that produces the above <select> list based on the current contents of the RailStations table with the stations listed in alphabetical order. If a train station's name is changed or a new station is added, an HTML form created by select-list-from-table.pl will immediately reflect the new value.
Because rows returned by a SELECT statement can be returned in any order specified by the ORDER BY clause, it is possible to produce a <select> list in an order other than alphabetical. For example, users might prefer to see a list of train stations by location.
The <select> list produced by Listing 1 has at least one problem. What if the user doesn't care about the particular piece of information in the <select> list? For example, assume the user is interested in traveling from Tel Aviv to Rehovot at some point during the day, but has not yet decided when. Forcing a user to choose a time from a <select> list makes things more difficult than they should be. It would be nice to have an “any” option on each <select> list, allowing users to indicate this particular field can have any information.
Implementing this strategy requires two things. First, one <option> value can be guaranteed not to correspond to an existing row's primary key. Second, the CGI program creating the final query will identify this value and modify the SQL accordingly.
Luckily, the combination of MySQL and Perl works quite well in this context. MySQL's auto-incrementing primary keys begin with 1 and increase until the maximum value is reached. Because an auto-incrementing primary key can never be zero, we can create an additional <option> line in the <select> list:
<option value="0" selected>Any
By making this value 0, we ensure it cannot correspond to any actual row's primary key. And by marking it “selected”, we set this to be the default value for the <select> menu. New visitors to the site who accept the default values will thus get the widest possible search, with the fewest possible WHERE clauses. Each selection of an <option> with a non-zero value will add a new WHERE clause to the resulting SQL.
The following query lists all times at which trains from Nahariya (ID 1) have an endpoint of Tel Aviv (ID 5):
SELECT S.name, DT.departure_time FROM RailStations S, DepartureTimes DT, Trains T, StationLines SL WHERE T.id = DT.train_id AND T.line_id = SL.line_id AND S.id = 1 AND SL.station_id = DT.station_id AND DT.station_id = S.id AND T.destination_id = 5 ORDER BY DT.departure_time ;
We can turn the above query into a CGI program by replacing the two ID numbers with placeholder values. If we fill in the placeholders with the contents of the “origin” and “endpoint” HTML form elements, we can find the times of all trains from one station heading toward a particular endpoint. Such a CGI program, list-trains-to-endpoint.pl, is in Listing 2 (see Resources).
The above code works just fine until someone selects one of the Any items with a value of 0. If that happens, MySQL will not return any rows from the SELECT, because no stations have an ID of 0. The solution is to make those parts of the query conditional, inserting them only if an actual value was indicated.
We accomplish this by placing two “if” statements in the middle of the code that assembles the SQL statement. Because the generic ID is 0, our program can test for a set value simply by putting the variable name inside parentheses, which implies a test for non-zero values:
my $sql = "SELECT S.name, DT.departure_time "; $sql .= "FROM RailStations S, DepartureTimes DT, Trains T, StationLines SL "; $sql .= " WHERE T.id = DT.train_id "; $sql .= " AND T.line_id = SL.line_id "; if ($origin) { $sql .= " AND S.id = ? "; push @placeholders, $origin; } $sql .= " AND SL.station_id = DT.station_id "; $sql .= " AND DT.station_id = S.id "; if ($endpoint) { $sql .= " AND T.destination_id = ? "; push @placeholders, $endpoint; } $sql .= " ORDER BY DT.departure_time ";
Listing 3 (see Resources) contains a program, better-list-trains-to-endpoint.pl, that allows users to specify a station of origin, an endpoint for the train, neither or both. If the user specifies only the station of origin, the program will display a list of trains departing from that station without regard for direction. If the user specifies only the endpoint, it will list all departures toward that station. Most applications do not need to give such headway to users, and might want to trap inputs in which both elements are assigned values of 0. At the same time, no harm comes from allowing users to amuse themselves with nonsensical queries.
The above queries work just fine, yet they ignore a crucial issue when working with train schedules: people typically want to specify the time they prefer to leave or arrive. It would certainly be possible to give users a set of <select> lists corresponding to various hours and minutes in the day, giving them fine control over the assembled query. We could also allow them to enter dates and times in text fields, but as with station names, there are too many possible ways for such input to go wrong.
It is probably easier for them to relate to time information such as “morning”, “afternoon” and “night”, rather than specifying hours. We can do this by using another set of <select> lists, this time specifying hours as values and by setting the hours to the end of the specified period. For example:
<select name="time"> <option value="12:00">Morning <option value="17:00">Afternoon <option value="21:00">Evening <option value="24:00">Night </select>
We can find all morning trains by asking for those earlier than 12:00, afternoon trains by asking for those before 17:00 and so forth.
With such a <select> list in place in the query generator form, we can rest assured that users will be able to find their train. If we are concerned that too many trains will match their query, we can add another <select> list to the HTML form, allowing users to limit the number of returned trains. MySQL supports a LIMIT clause on SELECT queries, making it possible to return a maximum number of rows.
Note that LIMIT will always contain a numeric value and can thus be inserted directly into the SQL query without the use of placeholders. Indeed, trying to use placeholders in a LIMIT clause will force the number to be quoted, which will cause a MySQL error.
best-list-trains.pl, a program that implements both of these ideas—the “time” element and a “limit” element—is in Listing 4 (see Resources).
While it is tempting to give users infinite freedom to enter information into a web site, it is usually best to limit their inputs as much as possible. Creating simple HTML-based query generators is not difficult and can even be easy once you get the hang of it. The trick is to formulate queries in such a way that the user can get the maximum information while knowing as little as possible about the underlying database. Even when the queries are easy to create, finding ways to turn those queries into language suitable for non-programmers can be a challenge.