At the Forge: Easier Python paths with pathlib

Working with files is one of the most common things developers do. After all, you often want to read from files (to read information saved by other users, sessions or programs) or write to files (to record data for other users, sessions or programs).

Of course, files are located inside directories. Navigating through directories, finding files in those directories, and even extracting information about directories (and the files within them) might be common, but they're often frustrating to deal with. In Python, a number of different modules and objects provide such functionality, including os.path, os.stat and glob.

This isn't necessarily bad; the fact is that Python developers have used this combination of modules, methods and files for quite some time. But if you ever felt like it was a bit clunky or old-fashioned, you're not alone.

Indeed, it turns out that for several years already, Python's standard library has come with the pathlib module, which makes it easier to work with directories and files. I say "it turns out", because although I might be a long-time developer and instructor, I discovered "pathlib" only in the past few months—and I must admit, I'm completely smitten.

pathlib has been described as an object-oriented way of dealing with paths, and this description seems quite apt to me. Rather than working with strings, instead you work with "Path" objects, which not only allows you to use all of your favorite path- and file-related functionality as methods, but it also allows you to paper over the differences between operating systems.

So in this article, I take a look at pathlib, comparing the ways you might have done things before to how pathlib allows you to do them now.

pathlib Basics

If you want to work with pathlib, you'll need to load it into your Python session. You should start with:

Note that if you plan to use certain names from within pathlib on a regular basis, you'll probably want to use from-import. However, I strongly recommend against saying from pathlib import *, which will indeed have the benefit of importing all of the module's names into the current namespace, but it'll also have the negative effect of importing all of the module's names into the current namespace. In short, import only what you need.

Now that you've done that, you can create a new Path object. This allows you to represent a file or directory. You can create it with a string, just as you might do a path (or filename) in more traditional Python code:

But wait a second. Do you use pathlib.Path to represent files or directories? The answer is "yes". You actually can use it for both. If you're not sure what kind of object you have, you always can ask it, with the is_dir and is_file methods:

Notice that just because you create a Path object doesn't mean that the file or directory actually exists. You can check that with the exists method:

Manipulating Paths

Let's say you want to work with a file called abc.txt in the directory /foo/bar. In a typical Python program, you then would say:

You aren't doing anything particularly exciting here; you're just joining two strings together, the first of which represents a directory and the second of which represents a file. But as you can see, there's already a problem, in that you don't have a / separating the directory from the filename.

Using os.path.join not only ensures that there are slashes where you need them, but it also works cross-platform, using \ if your program is running on a Windows system.

That's nice, but pathlib offers another option: you can use the / operator, normally used for division, to join paths together. For example:

It takes a bit of time to get used to seeing / between what you might think of as strings. But remember that dirname isn't a string; rather, it's a Path object. And / is a Python operator, which means that it can be overloaded and redefined for different types.

If you forget and try to treat your Path object as a string, Python will remind you:

Working with Directories

If your Path object contains a directory, there are a bunch of directory-related methods that you can run on it. Actually, you can run these methods on non-directory Path objects as well, but it won't end very usefully or well.

For example, let's say you want to find all of the files in the current directory. You can say:

Notice that the result from calling p.iterdir() is a generator object. You can put such an object in a for loop or other context that expects/requires iteration. The generator will return one value for each filename in your directory.

But, what if you're not interested in getting all of the filenames? What if you want to get only those files ending with .py? If you were working in the UNIX shell, you'd say something like ls *.py. Such a pattern isn't a regular expression, despite what many people believe. Rather, such a pattern is known as "globbing". The glob module in Python handles that for you, letting you say something like:

The result of invoking glob.glob is a list of strings, with each string containing a filename that matches the pattern.

Path objects have similar functionality, thanks to the glob method. Like iterdir, the glob method returns a generator, meaning that you can use it in a for loop. For example:

The good news is that you get back the filenames in the directory. And the filenames already have been filtered by glob, so you're getting only matches. The even better news is that you get back Path objects (in this case, PosixPath objects, since this example isn't on a UNIX system), which means that you can use all the tricks you've enjoyed so far.

Working with Files

Once you have a file, what can you do with it? Well, one obvious candidate is to open it and read its contents. You can do that with the read_bytes and read_text methods, which return "bytes" and string objects, respectively.

Note that unlike the read method that you typically can run on a "file" object in Python, both read_text and read_bytes open the file, retrieve its contents and close it again. Thus, you don't have to worry about where the internal file pointer is located or whether you'll be reading from the start of the file or elsewhere.

However, those methods can cause problems if you read from a particularly large file. Python happily will read as much as it can into a huge string, potentially using all (or most) of the memory on your computer.

A better strategy, and a traditional one in Python, is to read through the file's contents one line at a time. This is accomplished by putting an open "file" object into a for loop; file objects are iterable and return one line (that is, up to and including the following newline) in each iteration.

Note that although you certainly can use the built-in open function, you also can take advantage of the open method for Path objects:

This will print all of the lines in the file. Notice that open knows how to work with a Path object just as easily as a string. However, you'll also notice that when you print the file, the lines are double-spaced. That's because each iteration includes the newline character, and print also inserts a newline character after each line it prints. You can adjust this by passing an empty string to the end parameter in the print function:

Aside from opening files, you also can invoke a number of other methods on a Path object. For example, I mentioned before that you might not want to read the entirety of a large file into memory. You can check the file's size, as well as many other attributes, using the stat method. This method, like the traditional os.stat Python function, returns a file's size in bytes:

You similarly can retrieve other items that stat reports, including the file's most recent modification timestamp, and IDs of the user and group that own the file.

If you want to manipulate the filename, you can do so with methods, such as suffix:

Conclusion

If you work with files on a regular basis from within Python programs, I suggest you look at pathlib. It's not revolutionary, but it does help to bring a lot of file-manipulating code under one roof. Moreover, the / syntax, although odd-looking at the start, emphasizes the fact that you're dealing with Path objects, rather than strings. And besides, it's just convenient to have access to so much functionality without having to remember where it's located.

Resources

pathlib was first proposed (and accepted) in PEP 428, which is worth reading here. It has been around since Python 3.4. If you're still using Python 2.7, a package is available on PyPI with a backport, known as pathlib2.

About the Author

Reuven Lerner teaches Python, data science and Git to companies around the world. You can subscribe to his free, weekly "better developers" e-mail list, and learn from his books and courses at http://lerner.co.il. Reuven lives with his wife and children in Modi'in, Israel.