At the Forge

Checking Your HTML

Reuven M. Lerner

Issue #182, June 2009

Integrate HTML validation into your test suite for better HTML from the get-go.

We say that Tim Berners-Lee invented the World Wide Web, and that's certainly true. But, we can boil the Web down to three specific technologies: URLs (for uniquely identifying resources on the Internet), HTTP (a stateless protocol for transmitting documents) and HTML (a markup language). Each of these inventions was simple to understand, as well as simple to implement. And, it is this combination of simplicity and elegance that has made the Web the success that it is.

All three of these technologies have evolved over the years, reflecting new uses and needs. For example, HTTP now supports a system of “headers” in both the request and response, which can do everything from indicate the content type of the response body to provide hints regarding how long the data should be cached.

HTML has grown up quite a bit as well, evolving to become a truly semantic markup language (with styling information moved to external CSS documents) with a more rigorous and standardized definition. Standardization has made HTML slightly harder to write, in that you need to be more careful about items, such as tag names (keeping them lowercase), attributes (because not all are valid in all contexts) and closing tags. One advantage to such standardization is that we now can predict to a much greater degree what pages will look like across different browsers. Sloppy HTML means that the browser has to decide what you meant, which can have consequences that vary widely in their influence on the way the page looks.

More significant, the rise of AJAX as a paradigm for Web development has made it increasingly important that HTML be well formed. Many AJAX-related routines need to modify a particular element on the page in some way. The easiest way to do this typically is to grab the element via its id attribute, which is guaranteed to be unique. (If you want more than one element to use an ID, you really should use a class instead.) In the last few months, I have worked on a number of pages that had duplicate ID attributes. Sometimes this was the result of a simple mistake, and sometimes it resulted from ignorance on the part of a Web designer. But in all cases, this meant that my JavaScript performed differently from what I expected.

Although HTML validation might seem boring, it's actually an essential part of getting AJAX-powered, latest-paradigm, super-fancy Web sites to work. This month, I review a few tools I use to make sure the HTML I create is as standards-compliant as usual. I begin with some simple, manual tests that can run on individual pages. Then, I show some automated tools I use when developing applications in Ruby on Rails, allowing me to check the HTML of all of my pages en masse, including those that require password protection to access.

HTML Standardization

Before continuing, it's important to realize that HTML is a catchall term for many different, related markup languages. And, when I say markup, I mean that HTML is a language used to describe a text, identifying its different parts. For instance, a newspaper article will have a headline, one or more authors, one or more paragraphs of text, zero or more photographs, and one or more captions per photograph. A markup language doesn't add content to a document, but rather describes the individual parts of the document, so that they can be laid out and displayed in an appropriate way. In this sense, HTML is a direct descendant of SGML, a markup language that was developed many years previously, but which was far more difficult to work with.

Although there have been several versions of HTML over the years, let's focus on the ones that are most widely used today. Perhaps the most common version of HTML is an unstructured, unversioned, nonstandard document. I'm certainly guilty of creating many such documents, which look like this:


<html>
<head>
    <title>This is the title</title>
</head>
<body>
    <h1>This is the headline</h1>
    <p>This is a paragraph</p>
    <p>This is another paragraph</p>
</body>
</html>

Nothing is wrong, per se, with the above document. But, because it fails to indicate which version of HTML it is using, browsers must make a variety of assumptions. These assumptions can make it hard to predict how different browsers will operate, using something known as quirks mode.

Fortunately, we can choose a standard implementation and indicate that to a browser by adding a DOCTYPE declaration at the top of the document. When assigning the value of DOCTYPE, you need to decide whether you will use HTML or XHTML (that is, an XML-compliant version of HTML), and whether you want the strict, transitional or frameset variety of that markup language.

The strict version of each markup language is the ideal version that allows no styling elements. On a modern site, such styling should be defined in CSS, not in HTML. However, it may be difficult for some sites to comply with the strict definition, either because their authoring tools use tags that aren't allowed in the strict definition, or because the site's authors want to use forbidden elements, such as those for embedded Flash. To make the transition to strict HTML easier, the standards allow for transitional HTML, which provides a larger number of tags.

Let's define our tiny document as follows:


<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
    <title>This is the title</title>
</head>
<body>
    <h1>This is the headline</h1>
    <p>This is a paragraph</p>
    <p>This is another paragraph</p>
</body>
</html>

The <!DOCTYPE> declaration at the top of the page tells browsers (and any other programs that might try to parse the page) that we want to follow the standards, but that we'll do so using the transitional declaration.

Once we have indicated our willingness to apply the transitional standard, we may discover that our documents are no longer valid. For example, if I include an image in my HTML document:


<img src="/images/foo.jpeg">

With the above line inserted into my document, it is no longer valid, because it is missing an alt attribute. Once I add that attribute, the document is valid:


<img src="/images/foo.jpeg" alt="foo">

However, we can get even better results if we enforce XML considerations and declare our document to be XHTML transitional. To do that, we modify not only the !DOCTYPE declaration, but also the <html> tag:


<!DOCTYPE html
          PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
          "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
  <head>
    <title>This is the title</title>
  </head>
  <body>
    <h1>This is the headline</h1>
    <p>This is a paragraph</p>
    <p>This is another paragraph</p>
    <img src="/images/foo.jpeg" alt="foo">
  </body>
</html>

Suddenly, our document is invalid again. Because we have declared it to be XHTML transitional, we need to follow XML rules. We need to close our <img> tag, most easily accomplished by using the self-closing syntax:


<!DOCTYPE html
          PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
          "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
  <head>
    <title>This is the title</title>
  </head>
  <body>
    <h1>This is the headline</h1>
    <p>This is a paragraph</p>
    <p>This is another paragraph</p>
    <img src="/images/foo.jpeg" alt="foo" />
  </body>
</html>

With that in place, our document is now valid. As you can imagine, finding all the problems that might occur in a document can be difficult, even for someone who is trained and experienced. Trying to check all the pages on a site, particularly one that contains hundreds or thousands of pages, clearly would be impossible.

The solution, then, is to have a program check the pages' validity automatically, preferably as part of your automated tests. This way, you can discover when you have problems quickly and easily.

W3C Validator

One of the best tools for checking the validity of a page's markup is the World Wide Web Consortium's validator, available at validator.w3.org. I use the validator almost exclusively from within Firefox, into which I have installed the Web Developer plugin. This plugin lets you validate the HTML of any page, simply by selecting Validate HTML from the browser. The browser submits the page's URL to the W3C validator, which then gives a line-by-line indication of what problems (if any) the page contains.

The W3C validator has at least two problems, however. First, it requires that you submit each page, one at a time, to the validator program. This means a great deal of time and effort, just to check your pages. A second consideration is more practical; the validator works only with pages that are accessible via the Internet, without password protection. If your site is being developed on your local computer, and if you have a firewall protecting your business from the outside world, you probably will be unable to use the validator via the Web.

One solution to this problem is to install the W3C validator on your local computer. You can get the source code from validator.w3.org/source, which comes in the form of a Perl program. On modern Debian and Ubuntu machines, you can install w3c-markup-validator, which makes it available via your local Web server, ready to be invoked.

If you end up installing the validator manually, it requires a number of modules, which you might need to download from CPAN (Comprehensive Perl Archive Network), a large number of mirrors containing open-source Perl modules. It might take some trial and error to figure out which modules are necessary, although if you are an experienced user of the CPAN.pm installer, this shouldn't be too much trouble. Note that the SGML::Parser::OpenSP module requires the OpenSP parser, which you can get from SourceForge at openjade.sf.net.

As you might be able to tell, a number of these modules are required in order to handle alternate encoding schemes, particularly those for Asian languages. Even if you aren't planning to handle such languages, the modules are mandatory and must be installed.

The validator program, called check, should be put in a directory for CGI programs or in a directory handled by mod_perl, the Apache plugin that lets you run Perl programs at a higher speed, among other things. You also will need to install a configuration file, typically placed in the directory /etc/w3c, but which you can relocate by setting the W3C_VALIDATOR_CFG environment variable.

Validating Rails Templates

Now that you have the W3C checker installed on your own server, you can feed it URLs that aren't open to the public. But, if you are developing an application in Ruby on Rails, you can go one step better than this, integrating the W3C validator into your automated testing.

In order to do this, you need to install the html_test plugin for Rails. Go into your Rails application's root directory, and type:

script/plugin install 
 ↪http://htmltest.googlecode.com/svn/trunk/html_test

With this plugin in place, you now can use three new assertions in your functional and integration tests: assert_w3c returns true if the W3C validator approves of your HTML; assert_tidy returns true if you're using the HTML Tidy library, described below; and, assert_validates calls both of these.

So, if you have a FAQ page you want to check with an integration test, you can write something like this:

def test_faq
 get '/faq'
 assert_response :success
 assert_w3c
end

If the HTML for this page is approved by the W3C validator, everything is fine. If this page is not valid, you will get quite a bit of output, which you should redirect to a file. This file will contain not only the results of your tests, but also the same HTML output that you would have gotten from the public, Web-based W3C validator. This means you'll get a complete and easy-to-read description of what you did wrong.

You'll often discover that a large number of validation errors can be fixed with a small number of corrections. For example, when I ran this test against a sloppy FAQ page, I got six validation errors. I was able to fix all of them by indicating the appropriate namespace in my <html> tag and removing an extraneous </p> from the end of the file.

Checking HTML validity in this way is nice and easy. (It can be time consuming, however, to invoke the validator on every single page; I think the trade-off is worthwhile, but you might disagree.) If you always want to check HTML validity, you can change your test environment's configuration somewhat, so that it'll happen automatically, without having to invoke assert_w3c each time.

To do this, you need to modify test_helper.rb, which sits at the top of the test directory, and which is included into every test program. All you have to do is add:

ApplicationController.validate_all = true
ApplicationController.validators = [:w3c]

You also can check the validity of URLs and redirects; although these aren't checking HTML validity per se, they do come with the html_test plugin and are quite useful:

ApplicationController.check_urls = true
ApplicationController.check_redirects = true

With these four lines in your test_helper.rb, you can run your integration tests once again. If any of the validation tests fail, you can look at /tmp/w3c_last_response.html, which will contain the complete output of that failure. This doesn't help very much if you have multiple failures, however.

If you have designed your templates using the DRY (don't repeat yourself) principle, fixing HTML markup problems shouldn't be too bad. In many cases, you will need to change only one tag in the layout to fix everything.

HTML Tidy

The W3C validator is excellent, but it doesn't always catch everything, such as empty tags. For this, you might want to add to your arsenal, integrating the open-source Tidy library, which identifies and fixes badly written HTML. Tidy originally was written by Dave Raggett, one of the best-known developers from the early days of the Web; the project is now on SourceForge at tidy.sf.net.

To integrate Tidy checking into your Rails application, first install the library from SourceForge. Then, install the Ruby gem for Tidy integration:

sudo gem install tidy

Finally, download and install the Rails Tidy plugin:

cd vendor/plugins
wget
http://www.cosinux.org/~dam/projects/rails-tidy/rails_tidy-0.3.tar.bz2
tar -jxvf rails_tidy-0.3.tar.bz2

Now, modify test_helper to read:

ApplicationController.validators = [:w3c, :tidy]

With that in place, every request to your server now will be checked by both validators, rather than just one.

The Rails Tidy plugin can be useful beyond checking and validating to fix your HTML as it is sent from your server to the user's browser. Although I like this idea in theory, it seems fairly inefficient and slow to parse and rewrite every bit of HTML as it is sent. Plus, I feel that debugging Web applications (and CSS) is tough enough without having the HTML magically rewritten behind the scenes.

Conclusion

HTML has evolved quite a bit over the years, and getting your pages to contain valid HTML can be difficult to handle manually. For this reason, using automated checks and integrating those checks into a Web application's automated settings is a good way to ensure that your site is adhering to HTML standards as closely as possible. This not only gives you the greatest chance of having the site render similarly on different platforms, but it also even may boost your ranking in Google (an assertion I have seen mentioned in several places, but for which I obviously have no proof).

If you are using Ruby on Rails, you can validate your HTML easily from the start of your project. By doing so, you will make life easier for yourself down the line. Moreover, this is far easier than checking pages manually, and it ensures that even administrative and other hidden pages are validated.

Reuven M. Lerner, a longtime Web/database developer and consultant, is a PhD candidate in learning sciences at Northwestern University, studying on-line learning communities. He recently returned (with his wife and three children) to their home in Modi'in, Israel, after four years in the Chicago area.