Extract and Parse ODF Files with Python

Kamran Husain

Issue #156, April 2007

Use Python to demystify Open Document Format files.

The Open Document Format (ODF) Alliance is designed for sharing information between different word processing applications. This article highlights the basic structure of ODF files, some internals of the underlying XML files and shows how to use Python to read the contents to perform a simple search for keywords. The code also can be the basis for more-advanced operations. In the spirit of openness, we use open-source software to read the ODF files, which in this case are Python and the OpenOffice.org package.

If you are running a recent version of Linux or OS X, you already should have Python and OpenOffice.org installed on your machine. If you need the latest versions, Python is available for free from www.python.org for both the Windows and Linux platforms. The OpenOffice.org package also is available for free from www.openoffice.org. Installing OpenOffice.org on an XP desktop is relatively painless. Download the packages from their respective sites and run the installer. Once installed, simply run the application from the desktop with a click on the installed icons.

Once you have confirmed that you have the prerequisites, you can create an ODF file. Open up the Writer, type some text in a document and save it. You can read in a file and save it as an .odt file.

A quick look at extensions in the Save dialog reveals a lot. An ODF file can have many extensions, which provide a clue as to the type of information stored in it and the application that stored it. See Table 1.

Table 1. ODF File Types and Their Extensions

Document FormatFile Extension
OpenDocument Text *.odt
OpenDocument Text Template*.ott
OpenDocument Master Document*.odm
HTML Document*.html
HTML Document Template*.oth
OpenDocument Spreadsheet*.ods
OpenDocument Spreadsheet Template*.ots
OpenDocument Drawing*.odg
OpenDocument Drawing Template*.otg
OpenDocument Presentation*.odp
OpenDocument Presentation Template*.otp
OpenDocument Formula*.odf
OpenDocument Database*.odb

So, what's in an ODF file? An ODF file is basically a zipped archive with several XML files. The actual files and directories in a file will vary depending on the type of information and the system on which the document was created.

The first step in picking out the names of the files in an ODF file requires unzipping the file itself. Fortunately, Python has built-in support for dealing with this endeavor with the zipfile module. Type python on the command line to run an interactive shell. Running a shell allows you to examine the contents of objects returned from the modules. Because you'll probably be doing this only once per type of data, there is really no need to write and execute a script at this time. If you want to preserve the work for future use, it's better to write a script in a text editor or use the IDLE editor that comes with Python and save the script. See Listing 1 on how to show the member functions in a class or module.

The infolist() command from the Python documentation returns a list the objects of a zipped archive. Run the dir() command on the first object from this list to get more information stored for each zipped file (Listing 2). This nice feature of looking at members in an object is called introspection.

An iteration on the list of objects displays relevant information about each file, such as when it was created, its extracted size, its compressed size and so on. It's better at this point to write a Python script and run it rather than work at the command line of an interactive Python shell. This way, you can preserve the script for later use. So, open up a text editor and type in the script.

The import statement allows you to use the sys module for getting a command-line argument of the file, and the zipfile module loads in the functionality for reading and unzipping files. As you saw from the Python shell, the infolist() method on the zipfile archive lists the files in it. So iterating over the items from the infolist() and then calling an orig_filename member function gives you a list of all files in the archive.

For more detailed information, try something like this:

print s.orig_filename, s.date_time, s.filename,
 ↪s.file_size, s.compress_size

You will receive more information about the file, quite similar to this:

mimetype (2006, 9, 9, 7, 50, 10) mimetype 39 39
Configurations2/statusbar/ (2006, 9, 9, 7, 50, 10)
Configurations2/statusbar/ 0 0
Configurations2/accelerator/current.xml
 ↪(2006, 9, 9, 7, 50, 10)
Configurations2/accelerator/current.xml 0 2
Configurations2/floater/ (2006, 9, 9, 7, 50, 10)
Configurations2/floater/ 0 0
...SNIPPED FOR BREVITY...

A typical ODF text file (with the .odt extension) will have some of the following files when unzipped. Here's the output:

mimetype
Configurations2/statusbar/
Configurations2/accelerator/current.xml
Configurations2/floater/
Configurations2/popupmenu/
Configurations2/progressbar/
Configurations2/menubar/
Configurations2/toolbar/
Configurations2/images/Bitmaps/
layout-cache
content.xml
styles.xml
meta.xml
Thumbnails/thumbnail.png
settings.xml
META-INF/manifest.xml

The most important file in the archive is the content.xml file, because it contains the data for the document itself. I discuss this file here; however, for detailed information on each tag and so on, take a look at the specification in the 2,000+-page PDF file from the OASIS Web site (see Resources).

Basically, the content.xml file looks like a DHTML file with tags for all the contents. The tag I was concerned with most for my search operation was the <text:p> tag and its closing tag </text:p>, which wraps paragraphs in a document. I'll show you how to get this tag from a content file later in this article.

Other files of interest in the archive are the META-INF/manifest.xml, mimetype, meta.xml and styles.xml. Other files simply contain data for configurations for the word processors reading and displaying the contents file.

The manifest is simply an XML file with a listing of all the files in the zipped archive. The mimetype file is a single line containing the mimetype of the content file. The meta.xml contains information about the author, creation date and so on. The styles file contains all the formatting styles for displaying the file.

You can extract any of these files from the ODF file with the read() method on the zip object to get it as a very long string. Once read, you can modify, view and write the whole string to disk as an independent file. Listing 3 shows how to extract the manifest.xml file.

For more than one file, you can use a list instead of the if clause:

if s.orig_filename in ['content.xml', 'styles.xml']:

This way, you can extract whatever files you need to look at simply by reading in their contents and either manipulating them or writing them off to disk.

The contents of an XML file are best suited for manipulation as a tree structure. Use the XML parsing capabilities in Python to get a tree of all the nodes within an XML file. Once you have the tree in a content file, you easily can get to the <text:p> nodes. You don't really have to extract the file to disk, because you also can run an XML parser on the string just as well as reading from a file.

There are two types of XML parsers, SAX and DOM. The SAX parser is faster but less memory-intensive, because it reads and parses an input file one tag at a time. You have only one tag at a time to work with and must track data yourself. In contrast, the DOM parser reads the entire file into memory and therefore provides better options for navigating and manipulating the XML nodes.

Let's look at examples of using both SAX and DOM, so you can see which one suits your purpose. The SAX example shows how to extract unique node names from an XML file. The DOM example illustrates how to read values from within specific nodes once the file has been read completely into memory.

First, let's use the SAX parser to see what nodes are available in the content.xml file. The code simply prints the name of each type node found in the file. Note that for different types of files you may get different node names (Listing 4).

A SAX program requires a class derived from ContentHandler and overriding functions to handle the start, middle and end of each node. The tagHandler class shown in Listing 4 is used primarily to track the start of each node and ignores the contents. Use the names found in the listing as keys in a dictionary. Then, use the keys() method to list the names into a list that you also can sort(). I use this technique quite often to get a sorting of unique members quickly, because the hashing functions in Python dictionaries are very fast. Here's the output from the program:

office:automatic-styles
office:body
office:document-content
office:font-face-decls
office:forms
office:scripts
office:text
style:font-face
style:list-level-properties
style:paragraph-properties
style:style
style:text-properties
text:a
text:line-break
text:list
text:list-item
text:list-level-style-bullet
text:list-style
text:p
text:s
text:sequence-decl
text:sequence-decls
text:span

I sorted the list of keys and printed only the types of tags found. You will have many tags, and the order of the listed tags is not the way they are found in the XML file. The tag you most likely will be concerned with is <text:p>, which has the text in each paragraph. In this example, I want to search for keywords in the text for one or more paragraphs in a document.

Although I can use the SAX version of the program to search for the text, I use the DOM libraries, because the code will be a little easier to write (and subsequently, easier to maintain), and I promised an example of this earlier.

The xml.dom and xml.dom.minidom packages in Python allow for reading in XML files as DOM trees. Both packages come with a standard Python installation. Use the minidom package as it has a smaller footprint and is easier to use than the dom package. The minidom package is sufficient for almost all my XML work, and I have never really needed the heavyweight xml.dom package. See Resources for more information.

Use the minidom packages to read the elements of an XML file into a tree-like structure. The nodes of the tree are objects based on the Node class in Python. The output from Listing 4 provides the names of nodes.

Up to this point, you have been using simple programs to list and extract data from the archive. Now, the next example runs a search operation on the file you've just read in. Look at the program in Listing 5.

The program is designed to work as a class that reads and searches for text in an ODF file. Declaring a class for the ODF reader helps in organizing the code for searching text within a node. The showManifest() member function simply tells me what files exist in the ODF file. In this particular program, I collect all the text as a list of paragraphs, and then I search for the keywords passed in from the command line. If the searched word matches, the paragraph is printed out.

The text found in each <text:p> is Unicode text. You have to convert this to normal text in order to print correctly and/or use in a widget. The encode() command translates the Unicode to a printable string.

Unicode provides a unique number for every character, regardless of the platform, program and language being used. The ability to work seamlessly with the same text across multiple platforms is a great feature for Unicode-enabled applications. Such features do come with a price for some legacy operations. Each Unicode character can have flags as bits set for special printing and so on, which causes a normal print statement to interpret each character as a number instead of text. In Python, the encode() member function of a Unicode string returns a printable version of the string. Here is an example code snippet for that:

def findIt(self,name):
    for s in self.text_in_paras:
        if name in s:
            print s.encode('utf-8')

The code in Listing 5 is not limited to an ODT file. You can modify the code presented here to work with spreadsheet files with an .ods file. Run the program in Listing 3 to get the contents.xml file out, and then run the second program (shown in Listing 4) to list the types of nodes. Below is a sample .ods file; note that this file also has paragraphs in addition to the table tags:

office:automatic-styles
office:body
office:document-content
office:font-face-decls
office:scripts
office:spreadsheet
style:font-face
style:style
style:table-column-properties
style:table-properties
style:table-row-properties
table:table
table:table-cell
table:table-column
table:table-row
text:p

Use the program in Listing 5 to extract and search text from paragraphs as before. A simple modification of changing the text:p to table:table-cell searches for text within cells instead of paragraphs.

To summarize, an ODF file is a zipped archive of several XML files. One of these files contains contents in known tags. Each type of ODF file can have different tags based on stored information. By using introspection and the XML parsing capabilities in Python, you can list the types of nodes in a file and read them into a tree structure. Once read, you can extract data only from those nodes in the tree that are of interest to you.

Kamran Husain has been working with software for 20 years. He can be contacted at khusain62@yahoo.com.