At the Forge: Introducing Python 3.7's Dataclasses

Python 3.7's dataclasses reduce repetition in your class definitions. By Reuven M. Lerner

Newcomers to Python often are surprised by how little code is required to accomplish quite a bit. Between powerful built-in data structures that can do much of what you need, comprehensions to take care of many tasks involving iterables, and the lack of getter and setter methods in class definitions, it's no wonder that Python programs tend to be shorter than those in static, compiled languages.

However, this amazement often ends when people start to define classes in Python. True, the class definitions generally will be pretty short. But the __init__ method, which adds attributes to a new object, tends to be rather verbose and repetitive—for example:


class Book(object):
    def __init__(self, title, author, price):
        self.title = title
        self.author = author
        self.price = price

Let's ignore the need for the use of self, which is an outgrowth of the LEGB (local, enclosing, global, builtins) scoping rules in Python and which isn't going away. Let's also note that there is a world of difference between the parameters title, author and price and the attributes self.title, self.author and self.price.

What newcomers often wonder—and in the classes I teach, they often wonder about this out loud—is why you need to make these assignments at all. After all, can't __init__ figure out that the three non-self parameters are meant to be assigned to self as attributes? If Python's so smart, why doesn't it do this for you?

I've given several answers to this question through the years. One is that Python tries to make everything explicit, so you can see what's happening. Having automatic, behind-the-scenes assignment to attributes would violate that principal.

At a certain point, I actually came up with a half-baked solution to this problem, although I did specifically say that it was un-Pythonic and thus not a good candidate for a more serious implementation. In a blog post, "Making Python's __init__ method magical", I proposed that you could assign parameters to attributes automatically, using a combination of inheritance and introspection. This was was a thought experiment, not a real proposal. And yet, despite my misgivings and the skeletal implementation, there was something attractive about not having to write the same boilerplate __init__ method, with the same assignment of arguments to attributes.

Fast-forward to 2018. As I write this, Python 3.7 is about to be released. And, it turns out that one of the highlights of this new version is "dataclasses"—a way to write classes that removes the need to write boilerplate code. The implementation was done in a much different (and better) way than I had proposed, and it includes a great deal of functionality I hadn't even imagined. And yet, I expect that for many people, dataclasses will become their preferred way to create Python classes.

So in this article, I review the new dataclasses functionality in Python 3.7. If you're reading this before 3.7 has been released, I suggest downloading and installing it, albeit not as your main, production version of Python, just in case issues arise before the first production release.

Simple Dataclasses

Let's take the class from above:


class Book(object):
    def __init__(self, title, author, price):
        self.title = title
        self.author = author
        self.price = price

Here's how you can translate it into a dataclass:


from dataclasses import dataclass

@dataclass
class Book(object):
    title : str
    author : str
    price : float

If you have any experience with Python, you can recognize the outline of what's going on here, but a whole bunch of things are different.

First is using the dataclass decorator to modify class definition. Decorators are one of Python's most powerful tools, allowing you to modify functions and classes both when they are defined and when they are called. In this case, the decorator inspects the class definition and then writes __init__ and other methods on the fly, based on that definition.

Next, you'll notice that no __init__ has been defined, or any other methods, for that matter. Instead, what is defined is what would appear to be class attributes. But then again, they're not really class attributes, since they lack any values. So what are they doing?

Moreover, there might not be any values associated with these class attributes, but there are types, using the type-annotation syntax introduced in Python 3. Type annotations allow you to tag a variable with a particular object. The annotations aren't used or enforced by Python, but they can be used by your editor or by external programs (such as MyPy) to improve the accuracy of your code. You don't have to stick with the simple built-in types either; you can use the typing module to import a variety of predefined types, including one called Any if you want to allow for anything.

So already you likely can see a few advantages to dataclasses. You don't need to write the boilerplate code in __init__, and type annotations already are included. But aside from clearer, shorter code and the ability to run code checkers, what else do you get?

Well, it turns out that the @dataclass decorator doesn't just create __init__. It creates a number of other methods as well. For example, it defines __eq__, the method that lets you determine if two classes are equal to one another using the == equality operator. It also defines __repr__ to be far more attractive and useful than the existing Python default.

With the above class definition, you thus can say:


b1 = Book('MyTitle1', 'AuthorFirst AuthorLast', 20)
b2 = Book('MyTitle2', 'AuthorFirst AuthorLast', 25)

print(b1)
print(b2)

The output will be:


Book(title='MyTitle1', author='AuthorFirst AuthorLast', 
 ↪price=20)
Book(title='MyTitle2', author='AuthorFirst AuthorLast', 
 ↪price=25)

Note that while the attribute names are specified in the dataclass at the class level, the names actually are stored as attributes on the individual instances. You can see this by exploring the new objects a little bit. For example, if you ask to print vars(b1), you get the following:


{'title': 'MyTitle1', 'author': 'AuthorFirst AuthorLast', 
 ↪'price': 20}

And if you ask to see the type of b1.title, Python tells you that it's a string. So nothing fancy is being created here, such as a property or a descriptor. Rather, this is just creating a regular old class, albeit with some useful and interesting functionality.

Adding Methods

The name "dataclass" implies that such classes are to be used for data, and only data. And indeed, part of the thinking behind the development of dataclasses was that folks wanted something easier to write than regular Python classes, but with the same easy-to-read syntax as named tuples or dictionaries. The name implies that such classes are used only for storing data, without the ability to write methods.

But, that's not the case. You can add methods to a dataclass, just as you would add it to any other class. For example, say you want to get the book author's name as a list of strings, rather than as a single string. This would be useful if you want to alphabetize or display books by the author's last name and then first name.

In a dataclass, you add such a method by...adding the method. In the body of the class, you would write:


def author_split(self):
    return self.author.split()

In other words, you can create whatever methods you want, using the same syntax that you've used before.

Optional Functionality

Dataclasses offer a great deal of functionality that can help you modify the default behavior.

First and foremost, you can provide each of your declared attributes with a default value. Doing so makes them optional when you create a new instance. For example, say you want the default book price to be $20. You can say:


@dataclass
class Book(object):
    title : str
    author : str
    price : float = 20

Notice how the syntax reflects the Python 3 syntax for function parameters that have both type annotation and a default value. Just as is the case with function parameter defaults, dataclass attributes with defaults must come after those without defaults.

Rather than declaring a value for a default, you actually can pass a function that is executed (without any arguments) each time a new object is created.

To do this, and to take advantage of a number of other features having to do with dataclass attributes, you must use the field function (from the dataclass module), which lets you tailor the way the attribute is defined and used.

If you pass a function to the default_factory parameter, that function will be invoked each time a new instance is created without a specified value for that attribute. This is very similar to the way that the defaultdict class works, except that it can be specified for each attribute.

For example, you can give each new book a default random price between $20 and $100 in the following way:


import random
from dataclasses import dataclass, field

def random_price():
    return random.randint(20,100)

@dataclass
class Book(object):
    title : str
    author : str
    price : float = field(default_factory=random_price)

Note that you cannot both set default_factory and a default value; the whole point is that default_factory lets you run a function and, thus, provides the value dynamically, when the new instance is created.

The main thing that the __init__ method in a Python object does is add attributes to the new instance. Indeed, I'd argue that the majority of __init__ methods I've written through the years do little more than assigning the parameters to instance attributes. For such objects, the default behavior of dataclasses works just fine.

But in some cases, you'll want to do more than just assign values. Perhaps you want to set up values that aren't dependent on parameters. Perhaps you want to take the parameters and adjust them in some way. Or perhaps you want to do something bigger, such as open a file or make a network connection.

Of course, the whole point of a dataclass is that it takes care of writing __init__ for you. And thus, if you want to do more than just assign the parameters to attributes, you can't do so, at least not in __init__. I mean, you could define __init__, but the whole point of a dataclass is that it does so for you.

For cases like this, dataclasses have another method at their disposal, called __post_init__. If you define __post_init__, it will run after the dataclass-defined __init__. So, you're assured that the attributes have been set, allowing you to adjust or add to them, as necessary.

Here's another case that dataclasses handle. Normally, instances of user-created classes are hashable. But in the case of dataclasses, they aren't. This means you can't use dataclasses as keys in dictionaries or as elements in sets.

You can get around this by declaring your class to be "frozen", making it immutable. In other words, a frozen dataclass is defined at runtime and then never changes—similar to a named tuple. You can do this by giving a True value to the dataclass decorator's frozen parameter:


>>> @dataclass(frozen=True)
... class Foo(object):
...     x : int
...
>>> f1 = Foo(10)
>>> f1.x = 100
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
      File "/usr/local/lib/python3.7/dataclasses.py", line 448, 
       ↪in _frozen_setattr
    raise FrozenInstanceError(f'cannot assign to field {name!r}')
dataclasses.FrozenInstanceError: cannot assign to field 'x'

Moreover, now you can run hash on the variable:


>>> hash(f1)
3430012387537

There are a number of other optional pieces of functionality in dataclasses as well—from indicating how your objects will be compared, which fields will be printed and the like. It's impressive to see just how much thought has gone into the creation of dataclasses. I wouldn't be surprised if in the next few years, most Python classes will be defined as dataclasses, along with whatever customization and additions the user requests.

Conclusion

Python's classes always have suffered from some repetition, and dataclasses aim to fix that problem. But, dataclasses go beyond macros to provide a toolkit that a large number of Python developers can and should use to improve the readability of their code. The fact that dataclasses integrate so nicely into other modern Python tools and code, such as MyPy, tells me that it's going to become the standard way to create and work with classes in Python very quickly.

Resources

Dataclasses are described most fully in the PEP (Python Enhancement Proposal) 557. If Python 3.7 isn't out by the time you read this article, you can go to https://python.org and download a beta copy. Although you shouldn't use it in production, you definitely should feel comfortable trying it out and using it for personal projects.

About the Author

Reuven Lerner teaches Python, data science and Git to companies around the world. His free, weekly "better developers" email list reaches thousands of developers each week; subscribe here. Reuven lives with his wife and children in Modi'in, Israel.

Reuven Lerner