Hack and /

Read Linux Journal from the Command Line

Kyle Rankin

Issue #212, December 2011

Kindles? Nooks? It just takes your trusty command line and a few command-line tools to read Linux Journal.

In this day and age, there are more ways to read than ever before. Even though Linux Journal no longer publishes on paper, you still can read it with Web browsers, PDF software, e-book readers and cell phones. I don't have an e-book reader myself, but I think you could make the argument that the one true way to read Linux Journal is from the command line. After all, I read my e-mail, chat, check Twitter, do most of my day job and write my articles from the command line (okay, it's true I use gvim too; it frees up a terminal window), so why not read Linux Journal from the place where I spend most of my time?

The Text, the Whole Text and Nothing but the Text

The first format I'm going to cover is the Portable Document Format (PDF). Although PDFs are aimed at capturing a document so that it looks the same to everyone, it turns out you also can strip out the text and images from a PDF file. The first program I use for this is the aptly named pdftotext. This program is part of a group of PDF utilities that are packaged as the popper-utils package under Debian-based systems, but you should be able to find them under a similar name for your distributions. The most basic way to execute pdftotext is the following:

$ pdftotext input_document.pdf output_document.txt

By default, pdftotext does not attempt to preserve all the formatting of the document, which is nice because you don't have to scroll up and down multiple columns of a page. The downside is that it doesn't know to strip out all the extraneous text, headers, pull-quotes and other text you will find in a magazine article, so the result is a bit limited, as you can see in Figure 1.

Figure 1. pdftotext's Default Output for My Column

Text Plus Columns

So although I suppose pdftotext's default output is readable, it's less than ideal. That's not to say I'm out of tricks though. Among its command-line options, it provides a -layout argument that attempts to preserve the original text layout. It's still not perfect, as you can see in Figure 2, but if you size your terminal so that it can fit a full page, it is rather readable.

Figure 2. pdftotext with the Layout Preserved

Text Plus Images

There is a bit of a problem, you'll find, if you do read Linux Journal in text-only mode: there's no pictures! Although some articles still are educational in pure text, with others, it really helps to see a diagram, screenshot or some other graphical representation of what the writer is saying. You aren't without options, but this next choice is a bit of a hack. Because there are versions of the w3m command-line Web browser that can display images in a terminal (the w3m-img package on a Debian-based system provides it), what you can do is convert the PDF to HTML and then view the HTML with w3m. To do this, you use the pdftohtml program that came with the same package that provided pdftotext. This program creates a lot of files, so I recommend creating a new directory for your issue and cd-ing to it before you run the command. Here's an example of the steps to convert the September 2011 issue:

$ mkdir lj-2011-09
$ cd lj-2011-09
$ pdftohtml -noframes /path/to/linuxjournal201109-dl.pdf 
 ↪lj-2011-09.html

Once the command completes, you can run the w3m command against the lj-2011-09.html file, and if you have the special version that loads images, you will start to see the images load in the terminal. Now, by default, this output is much like the original output of pdftotext. There is no attempt to preserve formatting, so the output can be a bit of a mess to read. Also, as you can see in Figure 3, my headshot looks like a photo negative.

Figure 3. A More Negative Version of Me

Text Plus Images Plus Columns

Although it's nice to see the images in a terminal, it would be better if everything was arranged so it made a bit more sense. Like with pdftotext, pdftohtml has an option that attempts to preserve the layout. In the case of pdftohtml, you add the -c option:

$ mkdir lj-2011-09
$ cd lj-2011-09
$ pdftohtml -noframes -c /path/to/linuxjournal201109-dl.pdf 
 ↪lj-2011-09.html

On the one hand, this command generates some really good-looking graphical pages. On the downside, it looks like the images are displayed over the top of the text, and as you can see in Figure 4, there's a whole graphical section of my column with no text on it. As you scroll down the page, you still can read a good deal of the text, but it stands independent of the image. On the plus side, it no longer shows a negative headshot.

Figure 4. It's an improvement for image quality, but worse for readability.

Go with the Reflow

So PDF conversion technically worked, but there definitely was room for improvement. As I thought about it, I realized that epub files work really well when it comes to reflowing text on a small screen. I figured that might be a better source file for my command-line output.

The tool I found best-suited to the job of converting epub files to text is Calibre. In my case, I just had to install a package of the same name, and I was provided with a suite of epub tools, including ebook-convert. Like with pdftotext, all you need to do is specify the input file and output file, and ebook-convert generates the output file in the format you want based on the file extension (.txt in this case). To create a basic text file, I would just type:

$ ebook-convert /path/to/LJ209-sept.epub LJ209-sept.txt

I found the resulting text file quite readable actually, although it did like to indent all of the headers and a lot of the rest of the text, so it started at the center of the terminal. That said, I would say that so far, it was the most readable of the output, as you can see in Figure 5.

Figure 5. Even with Indents, a Quite Readable LJ Article

So, with all of those different ways to read Linux Journal from the command line, two methods stand out to me right now. If you don't need images, I think the epub-to-text conversion works the best, with the pdftotext that preserves layout coming in second. If you do need to see images though, it seems like your main choice either is to convert from PDF to HTML and then use w3m, or just use w3m to browse the Linux Journal archives directly.

Kyle Rankin is a Sr. Systems Administrator in the San Francisco Bay Area and the author of a number of books, including The Official Ubuntu Server Book, Knoppix Hacks and Ubuntu Hacks. He is currently the president of the North Bay Linux Users' Group.