Structuring XML Documents

Terry Dawson

Issue #55, November 1998

XML has nearly all of the power and features of SGML, but will probably be much better supported because of the web-driven market for browsers and editors.

  • Author: David Megginson

  • Publisher: Prentice Hall

  • E-mail: info@prenhall.com

  • URL: http://www.prenhall.com/

  • Price: $44.95 US

  • ISBN: 0-13-642299-3

  • Reviewer: Terry Dawson

Take a close look at any of the various documentation projects operating within the Linux community, and you will find SGML. The Linux Documentation Project, the Debian Documentation Project and others are using SGML as the primary tool in producing consistently structured and styled documentation. The search for a more sophisticated replacement for HTML has led to the development of XML, which is based heavily on SGML. XML has nearly all of the power and features of SGML, but will probably be much better supported because of the web-driven market for browsers and editors. For this reason, XML will probably replace SGML in many applications.

SGML and XML both provide a means of describing the structure of a document. SGML and XML rely on definitions called DTDs, Document Type Definitions, that describe document structure.

In this book, David Megginson competently explains the process of good quality DTD design. While the title suggests it is XML-specific, it is not. SGML and XML have so many similarities that it is possible to describe both simultaneously, highlighting differences between the two where they arise. This book is another in the Charles F. Goldfarb series, and in it, Mr. Megginson describes document structuring using both SGML and XML, managing to avoid confusing the reader during the process. The book has four main parts, and includes a CD-ROM with software that implements XML parsers, and a selection of modern and popular DTDs.

Part One provides some background on XML, describes how it is different from SGML and examines five popular and useful DTDs. This chapter isn't for people with no prior SGML or XML experience and isn't designed to teach you either, but if you are familiar with at least one of them, it will assist you in learning about the other. The chapter on DTD syntax clearly illustrates the differences and similarities between the two. DTDs examined in detail are:

  • ISO-12083

  • DocBook

  • Text-Encoding Initiative (TEI)

  • MIL-STD-38784 (CALS)

  • HyperText Markup Language (HTML 4.0)

The first four of these are in common use and have inspired many other DTD designs. The CALS table design, for example, has been borrowed many times and used in other DTDs.

Part Two covers the principles of DTD analysis. The core material of the book begins in these chapters. They describe how to critically analyse a DTD from three important perspectives: ease of learning, ease of use and ease of processing. The ease with which a particular DTD can be learned is critically important in having a DTD accepted by authors. If the DTD is difficult to learn, authors will tend not to use it, use only a small subset of it, or worse, misuse it by bending it to suit their needs. Mr. Megginson describes how to analyse the ease of learning of a DTD with the aim of instructing you how to design easy-to-learn DTDs.

The chapter entitled “Ease of Use” describes how to analyse a DTD to determine if it will be easy for authors to use when they are writing their documentation. Some of the issues explored are the naming of tags and attributes, when to use a new tag and when to add an attribute to an existing tag, and structural issues that can simplify or complicate an author's job.

The chapter on ease of processing is of particular interest to those who publish and develop processing tools. A DTD may be easy for the author to learn and use, but this doesn't always translate into something that is easy to process into printed or published form. The lessons are mostly common sense applied to the specific task of DTD design.

The third part of the book covers a number of advanced DTD maintenance and design issues. It will be of interest mostly to people who intend to use SGML or XML for purposes other than publishing, such as database systems or other information management applications. The first topic covered is that of DTD compatibility. I mentioned earlier that the CALS table design had been borrowed for use in other DTDs. When DTDs are similar, it is fairly simple to translate a document from one DTD to another. This is very useful if you wish to exchange documentation with a group which has a different DTD. This chapter describes how to identify compatibility and the advantages of keeping compatibility in mind when designing a DTD.

The second topic extends this discussion to exchanging document fragments. A document fragment might be a single chapter or paragraph from a book. If you wish to share portions of a document with a group using a different DTD, you will find useful tips in this chapter on ways to simplify the task.

The final topic in this section is DTD customisation. DTD customisation is the process of taking an existing DTD and modifying it to suit your specific purposes. Designing a sophisticated DTD can be a complex task. Often, there is little reason to design a DTD from scratch; an existing DTD may provide 95% of what you need, requiring only a small amount of customisation to fully meet your needs. This can save a lot of time and provides advantages in terms of document exchange and compatibility. This chapter describes how to customise DTDs, and how to design DTDs that are easy to customise. The DocBook DTD, for example, was designed with hooks in place that allow for easy customisation.

The fourth and final part of the book covers DTD design using a technique called Architectural Forms. Architectural Forms allow DTD designers to specify the method by which their DTD should be translated into one or more other DTDs. Architectural Forms allow you to write documents which are simultaneously valid for a number of different DTDs. This section of the book describes the concepts and the implementation of Architectural Forms and offers useful hints and advice to designers wishing to use this facility. I found this part of the book a little difficult to comprehend, but that was almost certainly due to my limited exposure to applications requiring use of this advanced technique. I'm confident that anyone with an application for Architectural Forms will find the information presented to be a good introduction to the topic.

I am pleased to report that the CD-ROM includes Linux versions of the XML parsing software. Two XML parsers are provided. The first is a precompiled version of the popular “SP” parser. The second is a Java-based XML parser called “Aelfred”. Each DTD described in the book is included in its SGML form, as well as a number of links to useful resources on the Internet. The CD-ROM provides some HTML-based documentation, but is otherwise not well documented. I am left with the impression that the CD-ROM was a last-minute addition to the book; nevertheless, it does provide tools to allow the reader to experiment with the techniques described, and to that end it is adequate.

I found Structuring XML Documents to be an interesting and informative book that I will certainly be using as a reference in the future. David Megginson has done a nice job of concisely capturing a lot of material while keeping the pace slow enough to allow one to absorb the information fairly comfortably. The book is ideal for both SGML or XML designers, and SGML designers should not be misled by the title. I recommend that anyone with an interest in DTD design, especially those involved with Linux-related documentation projects, take a look at this book. It is certain to be of assistance in your efforts.

Terry Dawson is a full-time Linux advocate who is reveling in his new role of “Dad” with the birth of his son Jack. Mr. Dawson can be reached at terry@perf.no.itg.telecom.com.au.