Cooking with Linux

Words, Words, Words...

Marcel Gagné

Issue #154, February 2007

When it comes to being understood and sharing information, it's not just about open source, it's about open standards.

What are you doing, François? Our guests will be here any moment, and you are still sitting in front of your computer. Quoi? Yes, I agree with you that it's a good idea to store all these old documents in OpenDocument format. I admire your desire to ensure the long-term usability of these documents by converting them from their limited, proprietary format, but this is not the way. We have thousands of documents from hundreds of people on this storage area network. Converting them one at a time as you are doing will take forever, and we are minutes away from opening time. Besides, I have a much better way to deal with this and you'll see it on tonight's menu.

Vite! To the wine cellar. I see our guests coming to the door now. There are six cases of 2002 Paso Robles Zinfandel over in the East wing, right next to the old door marked Danger—I should really have you check that out sometime so we can find out what is back there—bring the wine and I will greet our guests. Vite!

Welcome, mes amis, to Chez Marcel, where fine Linux and open-source fare is married with some of the world's best wines. Your tables are ready and waiting, so please, sit down and make yourselves comfortable. My faithful waiter, François, will return shortly from the cellar with your wine for tonight. Before you arrived, we were discussing a little project to convert all of the old proprietary format .doc documents to the OpenDocument format, OpenOffice.org's default document format. This is the OASIS OpenDocument XML (eXtensible Markup Language) format, an open standard for document formats (it is saved with an .odt extension). The OpenDocument format is the closest thing to document freedom you will get (short of plain text). The format is vendor- and application-neutral. You are guaranteed support and portability because it is an open standard. Many organizations, such as the European Commission and the state of Massachusetts, are starting to recommend the OASIS OpenDocument format for the very reasons I've mentioned.

Ah, François, good to see you made it back with the wine. Please, pour for our guests. Enjoy, mes amis. You'll find this particular wine rich and jammy, with wonderful black raspberry flavors, a little licorice, a little pepper....

Ah, where was I? Oh, yes—converting to OpenDocument makes sense, but some people, of course, will stay with the Word format, not so much for technical reasons, as for simple inertia. After all, Microsoft Word is everywhere. The sheer number of Word installations is the very reason that OpenOffice.org was designed to support the Microsoft Office format as thoroughly as it does. That said, if you do want to switch to the OASIS OpenDocument format, OpenOffice.org Writer provides an easy way to do that. Rather than converting documents one by one, the Document Converter speeds up the process by allowing you to run all the documents in a specific directory in one pass. It also works in both directions, meaning you can convert from Word to OpenOffice.org format and vice versa. The conversion creates a new file but leaves the original as it is. Here's how you do it.

From the menu bar, select File, move your mouse to Wizards, then select Document Converter from the submenu. To convert your Microsoft Office documents, click the Microsoft Office radio button, then check off the types of documents you want (Figure 1). You can do Excel and PowerPoint documents at the same time.

Figure 1. Start by selecting the types of documents you want to convert.

The next screen asks whether you want both documents and templates or just one or the other. You then type in the name of the directory you want to import from and save to. This can be the same directory or you can choose an alternate. You need to answer this set of questions three times if you've chosen to do Excel and PowerPoint files at the same time, but the dialog is the same as the one you'll see for Word documents (Figure 2).

Figure 2. If you choose to convert Excel and PowerPoint documents as well, you'll get a similar dialog for each one.

After you've entered your information and gone to the next screen, the program confirms your choices and gives you a final chance to change your mind. Click Convert to continue. As the converter does its job, it lists the various files that it encounters and keeps track of the process. Click the Show Log button to see a listing of everything the converter encountered (Figure 3). When the job is done, you'll have a number of files with .odt extensions in your directory. Spreadsheets will have .ods extensions, and presentations will have .odp extensions. If you change your mind, don't worry. Your original files are still there, so you've lost nothing.

Figure 3. The progress dialog shows the status of the conversion and provides a log of the process.

As you can see, it's easy. And, this wine is easy-drinking, I see. François, some of our guests' glasses are looking a little empty. Please, resolve this issue for them with a little top-up. Merci, mon ami.

If you've never taken a good look at an OpenDocument document, you should. It's quite fascinating, actually. What you may not know is that the .odt file is actually a compressed file containing all the elements that make up your document. To be exact, it's a ZIP file. Let's say you had a document titled mydocument.odt that contained several images in addition to the text itself. To extract and view the elements, type the following in a shell or terminal window (you may want to do this in a temporary folder somewhere):

zip mydocument.odt

The result would look like this.

Archive:  mydocument.odt
  Length     Date   Time    Name
 --------    ----   ----    ----
       39  10-13-06 20:09   mimetype
        0  10-13-06 20:09   Configurations2/statusbar/
        0  10-13-06 20:09   Configurations2/accelerator/current.xml
        0  10-13-06 20:09   Configurations2/floater/
        0  10-13-06 20:09   Configurations2/popupmenu/
        0  10-13-06 20:09   Configurations2/progressbar/
        0  10-13-06 20:09   Configurations2/menubar/
        0  10-13-06 20:09   Configurations2/toolbar/
        0  10-13-06 20:09   Configurations2/images/Bitmaps/
    24634  10-13-06 20:09   Pictures/1235696243C.png
    14808  10-13-06 20:09   Pictures/10C3F082746.png
    68331  10-13-06 20:09   Pictures/20963618D3B.png
     1925  10-13-06 20:09   Pictures/19C4B78A82D.png
     9677  10-13-06 20:09   Pictures/112FEC43498.png
     6100  10-13-06 20:09   Pictures/1005A594DCB.png
   172170  10-13-06 20:09   Pictures/3009CCB23C4.png
       54  10-13-06 20:09   layout-cache
    23674  10-13-06 20:09   content.xml
     7950  10-13-06 20:09   styles.xml
     1211  10-13-06 20:09   meta.xml
     4899  10-13-06 20:09   Thumbnails/thumbnail.png
     7386  10-13-06 20:09   settings.xml
     2904  10-13-06 20:09   META-INF/manifest.xml
 --------                   -------
   345762                   23 files

This collection of XML definitions, images and so on, makes the document portable and readable by other programs.

Of course, mes amis, even if you have all these old documents and you want to preserve them in some kind of open format that doesn't require a copy of Microsoft Office, you may not need them in editable format. A simple, read-only format, such as PDF, may be the answer. OpenOffice.org has a built-in export to PDF, but unlike the document converter, this is a one-at-a-time affair. As François can tell you, one at a time can take a long time.

If you have OpenOffice.org on your system, I have just the thing for you. It's an OpenOffice.org macro document called—wait for it—Document Converter, or just DocConverter. This macro, written by Danny Brewer and Don Horwood, is designed to let you do batch conversions of any document format OpenOffice.org supports to any other format it supports easily. In other words, the output doesn't have to be PDF, as there are a number of alternatives you can choose. You can find the Document Converter at the OpenOffice.org Macros Web site (see the on-line Resources). Macros are sorted into end-user applications and those suited to developers. Click the For End-Users link at the top of the page, and scroll down to find Document Converter.

To use the macro, unzip the file and save the document somewhere. When you open it with OpenOffice.org Writer, a warning dialog appears, asking you if you want to enable the macros in the document. The correct answer, in this case, is yes. The document that appears is exactly that, a document. At the top left of the document is a big button labeled Document Converter (Figure 4). Click that button, then simply follow the wizard that pops up. Tell it which folder your Word files are in and what folder you would like the PDF files to appear in. It's a simple point-and-click task.

Figure 4. To make the document converter do its thing, simply click the big button and follow the instructions in the wizard.

Don't forget to check out some of the other great macros that exist on the site. You might discover something you can't live without.

All this converting of documents using OpenOffice.org is pretty cool, but it also tends to make us forget the great and powerful conversion tools that lie just beneath the graphical surface of your Linux system. Most distributions come with a variety of document converters waiting for the command-line user to use them. For instance, you may have PostScript documents that you want to convert to PDF, so you can send them to friends or family who don't understand PostScript. The command-line program, ps2pdf, comes in extremely handy under those circumstances:

ps2pdf mydocument.ps mydocument.pdf

The ps2pdf program produces a document compatible with Acrobat Reader, version 3, also known as PDF, version 1.2. To create version 1.3 PDF output (for Acrobat Reader 4 or later), use ps2pdf13. There's also a ps2pdf14 program. I'll leave it to you to guess which version of PDF it outputs. You also can convert PDF documents to PostScript by using pdf2ps and PostScript documents to plain ASCII text with pstotext. You'll also find a program called ps2ascii, which does more or less the same thing, but it doesn't handle encoded text (such as French accents) as well.

Hey, how about a nice, plain-text document from that Web site, minus all the HTML tags? That's the idea behind the html2text program. To define the output file, you need to specify it using the -o option:

html2text -o outputfile.txt http://somedomain.dom/document.html

If you are curious to see what sorts of conversions you can do, change directory to /usr/bin and look for the commands that include a 2 or a to. Not everything you see will be a document converter, of course, but you'll discover some interesting commands that are.

Before I leave the subject of Word document conversion completely, I need to mention Dom Lachowicz's wvWare (which started life as just wv when Caolán McNamara wrote it). The package is available from SourceForge (see Resources), but you should have no trouble finding a package for your particular Linux distribution. For wv, think “Word Viewer”. This package allows you to convert (or view) Microsoft Word documents to a wide variety of formats. wvWare is actually a collection of command-line tools, such as wvText:

vwText SomeWordDocument.doc

The output will go directly to your screen, so you may want to capture it by redirecting to a file or piping it to more (or less). There's also wvPDF to convert to PDF, wvLatex to convert to Latex, wvAbw to create Abiword-compatible documents and more. Check out the site documentation for all the alternatives.

Why use all these text tools when the graphical alternative exists? The answer, mes amis, is speed. Speed and flexibility. Sorry, the two answers are speed and flexibility—okay, I'll stop right there.

All through this discussion, I've concentrated on text, but converting to open standards covers a lot of possibilities, including graphics, video files, music files and more. Tackling these formats is the beginning of another, rather rich, menu, but alas, that insistent clock is telling us that closing time is here. As you can see, there are many opportunities for taking those old, closed-format documents and storing them in formats you will be able to access years from now, free from the will or whim of some mega-corporation's definition (or support) of what it calls a standard. Plain text, mes amis, is still the most portable of all formats. Nevertheless, choosing and using an open document format, such as OpenDocument, allows you to take advantage of the portability of plain text and the richness of graphics and other non-text elements.

François, please refill our guests' glasses one more time. And now, mes amis, raise your glasses and let us all drink to one another's health. A votre santé Bon appétit!

Resources for this article: /article/9509.

Marcel Gagnéis an award-winning writer living in Waterloo, Ontario. He is the author of the all new Moving to Ubuntu Linux, his fifth book from Addison-Wesley. He also makes regular television appearances as Call for Help's Linux guy. Marcel is also a pilot, a past Top-40 disc jockey, writes science fiction and fantasy, and folds a mean Origami T-Rex. He can be reached via e-mail at mggagne@salmar.com. You can discover lots of other things (including great Wine links) from his Web site at