Working with files is one of the most common things developers do. After all, you often want to read from files (to read information saved by other users, sessions or programs) or write to files (to record data for other users, sessions or programs).
Of course, files are located inside directories. Navigating through directories, finding files in those directories, and even extracting information about directories (and the files within them) might be common, but they're often frustrating to deal with. In Python, a number of different modules and objects provide such functionality, including os.path, os.stat and glob.
This isn't necessarily bad; the fact is that Python developers have used this combination of modules, methods and files for quite some time. But if you ever felt like it was a bit clunky or old-fashioned, you're not alone.
Indeed, it turns out that for several years already, Python's standard library has come with the pathlib module, which makes it easier to work with directories and files. I say "it turns out", because although I might be a long-time developer and instructor, I discovered "pathlib" only in the past few months—and I must admit, I'm completely smitten.
pathlib has been described as an object-oriented way of dealing with paths, and this description seems quite apt to me. Rather than working with strings, instead you work with "Path" objects, which not only allows you to use all of your favorite path- and file-related functionality as methods, but it also allows you to paper over the differences between operating systems.
So in this article, I take a look at pathlib, comparing the ways you might have done things before to how pathlib allows you to do them now.
If you want to work with pathlib, you'll need to load it into your Python session. You should start with:
import pathlib
Note that if you plan to use certain names from within pathlib on a
regular basis, you'll probably want to use from-import
. However, I
strongly recommend against saying from pathlib import *
, which
will indeed have the benefit of importing all of the module's names
into the current namespace, but it'll also have the negative effect
of importing all of the module's names into the current namespace. In
short, import only what you need.
Now that you've done that, you can create a new Path object. This allows you to represent a file or directory. You can create it with a string, just as you might do a path (or filename) in more traditional Python code:
p2 = pathlib.Path('.')
But wait a second. Do you use pathlib.Path
to represent files or
directories? The answer is "yes". You actually can use it for both.
If you're not sure what kind of object you have, you always can ask
it, with the is_dir
and is_file
methods:
>>> p1 = pathlib.Path('hello.py')
>>> p2 = pathlib.Path('.')
>>> p1.is_file()
True
>>> p2.is_file()
False
>>> p1.is_dir()
False
>>> p2.is_dir()
True
Notice that just because you create a Path object doesn't mean that the
file or directory actually exists. You can check that with the
exists
method:
>>> p1 = pathlib.Path('hello.py')
>>> p1.exists()
True
>>> p2 = pathlib.Path('asdfafafsafaa')
>>> p2.exists()
False
Let's say you want to work with a file called abc.txt in the directory /foo/bar. In a typical Python program, you then would say:
open('/foo/bar' + 'abc.txt')
You aren't doing anything particularly exciting here; you're just joining two strings together, the first of which represents a directory and the second of which represents a file. But as you can see, there's already a problem, in that you don't have a / separating the directory from the filename.
You can avoid such problems by using os.path.join
:
>>> import os.path
>>> dirname = '/foo/bar'
>>> filename = 'abc.txt'
>>> os.path.join(dirname, filename)
'/foo/bar/abc.txt'
Using os.path.join
not only ensures that there are slashes
where you
need them, but it also works cross-platform, using \ if your program
is running on a Windows system.
That's nice, but pathlib offers another option: you can use the / operator, normally used for division, to join paths together. For example:
>>> dirname = pathlib.Path('/foo/bar')
>>> dirname / filename
PosixPath('/foo/bar/abc.txt')
It takes a bit of time to get used to seeing / between what you might
think of as strings. But remember that dirname
isn't a string;
rather, it's a Path object. And / is a Python operator, which means
that it can be overloaded and redefined for different types.
If you forget and try to treat your Path object as a string, Python will remind you:
>>> dirname + filename
TypeError: unsupported operand type(s) for +: 'PosixPath'
↪and 'str'
If your Path object contains a directory, there are a bunch of directory-related methods that you can run on it. Actually, you can run these methods on non-directory Path objects as well, but it won't end very usefully or well.
For example, let's say you want to find all of the files in the current directory. You can say:
>>> p = pathlib.Path('.')
>>>
>>> p.iterdir()
<generator object Path.iterdir at 0x111e4b1b0>
Notice that the result from calling p.iterdir()
is a generator
object. You can put such an object in a for
loop or other context
that expects/requires iteration. The generator will return one value
for each filename in your directory.
But, what if you're not interested in getting all of the filenames? What
if you want to get only those files ending with .py? If you were
working in the UNIX shell, you'd say something like ls *.py
.
Such a pattern isn't a regular expression, despite what many people
believe. Rather, such a pattern is known as "globbing". The
glob
module in Python handles that for you, letting you say something like:
import glob
glob.glob('*.py')
The result of invoking glob.glob
is a list of strings, with each
string containing a filename that matches the pattern.
Path objects have similar functionality, thanks to the glob
method. Like iterdir
, the glob
method returns a generator, meaning
that you can use it in a for
loop. For example:
>>> p.glob('*.py')
<generator object Path.glob at 0x111b38480>
>>> for one_item in p.glob('*.py'):
print(f"{one_item}: {type(one_item)}")
hello.py: <class 'pathlib.PosixPath'>
reverse_lines.py: <class 'pathlib.PosixPath'>
old_test_hello.py: <class 'pathlib.PosixPath'>
The good news is that you get back the filenames in the directory. And
the filenames already have been filtered by glob
, so you're
getting only matches. The even better news is that you get back Path
objects (in this case, PosixPath
objects, since this example
isn't on a UNIX
system), which means that you can use all the tricks you've enjoyed
so far.
Once you have a file, what can you do with it? Well, one obvious
candidate is to open it and read its contents. You can do that with
the read_bytes
and read_text
methods, which
return "bytes" and
string objects, respectively.
Note that unlike the read
method that you typically can run on a
"file" object in Python, both read_text
and
read_bytes
open the
file, retrieve its contents and close it again. Thus, you don't have
to worry about where the internal file pointer is located or whether
you'll be reading from the start of the file or elsewhere.
However, those methods can cause problems if you read from a particularly large file. Python happily will read as much as it can into a huge string, potentially using all (or most) of the memory on your computer.
A better strategy, and a traditional one in Python, is to read through
the file's contents one line at a time. This is accomplished by
putting an open "file" object into a for
loop; file objects are
iterable and return one line (that is, up to and including the following
newline) in each iteration.
Note that although you certainly can use the built-in open
function, you
also can take advantage of the open
method for
Path
objects:
>>> p = pathlib.Path('hello.py')
>>> for one_line in p.open():
>>> print(one_line)
This will print all of the lines in the file. Notice that open
knows
how to work with a Path object just as easily as a string. However,
you'll also notice that when you print the file, the lines are
double-spaced. That's because each iteration includes the newline
character, and print
also inserts a newline character after each
line it prints. You can adjust this by passing an empty string to the
end
parameter in the print
function:
>>> for one_line in p.open():
>>> print(one_line, end='')
Aside from opening files, you also can invoke a number of other methods
on a Path object. For example, I mentioned before that you might not
want to read the entirety of a large file into memory. You can check
the file's size, as well as many other attributes, using the
stat
method. This method, like the traditional os.stat
Python function,
returns a file's size in bytes:
>>> p.stat().st_size
123
You similarly can retrieve other items that stat
reports, including
the file's most recent modification timestamp, and IDs of the user and
group that own the file.
If you want to manipulate the filename, you can do so with
methods, such as suffix
:
>>> p.suffix()
'.py'
If you work with files on a regular basis from within Python programs, I suggest you look at pathlib. It's not revolutionary, but it does help to bring a lot of file-manipulating code under one roof. Moreover, the / syntax, although odd-looking at the start, emphasizes the fact that you're dealing with Path objects, rather than strings. And besides, it's just convenient to have access to so much functionality without having to remember where it's located.
pathlib was first proposed (and accepted) in PEP 428, which is worth reading here. It has been around since Python 3.4. If you're still using Python 2.7, a package is available on PyPI with a backport, known as pathlib2.