At the Forge

PostgreSQL 9.5

Reuven M. Lerner

Issue #265, May 2016

Reuven describes some of the most interesting features of PostgreSQL 9.5.

Longtime readers of this column know that although NoSQL databases certainly have their place, I'm definitely a fan of relational (SQL) databases. And when choosing a relational database, my hands-down favorite is PostgreSQL. I've been using it for many years, starting soon after it was released, and although those early versions of PostgreSQL were packed with features, there also were significant limitations in nearly every aspect.

Today, however, PostgreSQL is recognized not just as a sophisticated database, but as a platform that can be used for storing and retrieving data, for working with data on many types of remote systems, and for manipulating and analyzing that data. Just before sitting down to write this article, I finished teaching a course in PostgreSQL to a high-tech company—and throughout the course, even people who have been working with PostgreSQL for years said, “I didn't know that it could do that too!”

PostgreSQL 9.5, which was released earlier this year, might not have earth-shattering features, but it continues in a long tradition of carefully managed, rock-solid releases, combining new functionality with improved performance.

I've recently recommended that my clients start to upgrade to PostgreSQL 9.5. Along the way, I've been looking into some of the improvements and additions that were contained in this release. In this article, I review some of the more interesting ones and describe how they can be used. At the end of the day, a database is designed to help you extract useful information easily and quickly; most of these features fit precisely into that category.

UPSERT

Perhaps the most talked-about feature in PostgreSQL 9.5 is what's known as “UPSERT” in database circles. The basic idea is that you should be able to INSERT a new record into a table, but if the new record would collide with an existing one, then you UPDATE the existing row instead.

For example, let's assume you have the following table:

CREATE TABLE People (
    id           SERIAL PRIMARY KEY,
    ssn          TEXT NOT NULL,
    first_name   TEXT NOT NULL,
    last_name    TEXT NOT NULL,

    UNIQUE(ssn)
);

The above table assigns the id column to be a primary key, using PostgreSQL's SERIAL pseudo-type to create a sequence and ensure that it auto-increments when you don't set it explicitly. But, say you also want to keep track of people's Social Security Numbers (on the assumption that everyone in the database is from the United States), which also are supposed to be unique and, thus, get a UNIQUE index. Then you have the person's first and last names.

If you try to INSERT a new person into this table, it should work fine:

INSERT INTO People (ssn, first_name, last_name)
    VALUES ('123-456-7890', 'John', 'Smith');

If you try to INSERT the same row again, you'll get an error. Even if the user's SSN is the same, but the name has changed, this won't work. But of course, that's the whole idea of having the UNIQUE clause in the table definition; you want to ensure that no other row can have the same SSN.

But, there likely will be many cases when you basically want to say to the database, “If this is a new SSN, then insert the new row. But if someone with this SSN already exists, then update their record to reflect the new name.”

You can do this by modifying the INSERT statement to include an ON CONFLICT clause. This clause can work in a few ways. First, it can indicate that you silently want to ignore such conflicts:

INSERT INTO People (ssn, first_name, last_name)
    VALUES ('123-456-7890', 'Richard', 'Roe')
ON CONFLICT DO NOTHING;

The above query basically tells PostgreSQL, “Insert this row if you can. If not, don't worry about it.”

A more common action, and the source of the “UPSERT” term, is to say “ON CONFLICT DO UPDATE”, providing an UPDATE-like statement:

INSERT INTO People (ssn, first_name, last_name)
    VALUES ('123-456-7890', 'Richard', 'Roe');
ON CONFLICT DO
    UPDATE SET first_name = 'Richard', last_name = 'Roe'
           WHERE ssn = '123-456-7890';

If you have assigned a default value to one of the columns in question, you even can say:

SET colname = DEFAULT

and the default value will be inserted.

If you want to set a value based on an existing one, you can do that, by using the standard TableName.colname syntax.

Note that UPSERT is one of those things that you should think about very carefully. Do you really want to replace an existing record with a new one? Or, do you want to enforce the uniqueness and, thus, get an error if you try to INSERT a new row whose unique values clash with those already in the database? That's a question that only you can answer, and it's the reason why you have to add that to your INSERT statement explicitly.

That said, this functionality, used appropriately, allows you to shorten your queries and make your logic clearer.

Better Grouping

One of the first things you learn in SQL is to count things. For example:

SELECT COUNT(*) FROM People;

That'll tell you how many people you have overall. But in most cases, you want to break that apart, finding out how many people of various sorts you have. I'm going to create a new People table and populate it based on a recently released movie:

CREATE TABLE People (
    id SERIAL PRIMARY KEY,
    name TEXT NOT NULL,
    species TEXT NOT NULL,
    gender CHAR(1) NOT NULL,
    side TEXT NOT NULL
);

INSERT INTO People (name, species, gender, side)
    VALUES ('Luke', 'Human', 'M', 'Good'),
           ('Leia', 'Human', 'F', 'Good'),
           ('Han', 'Human', 'M', 'Good'),
           ('Chewbacca', 'Human', 'M', 'Good'),
           ('Kylo', 'Human', 'M', 'Evil'),
           ('Phasma', 'Human', 'F', 'Evil'),
           ('Rey', 'Human', 'F', 'Good'),
           ('Finn', 'Human', 'M', 'Good'),
           ('R2D2', 'Droid', 'D', 'Good'),
           ('C3P0','Droid', 'D', 'Good'),
           ('BB8', 'Droid', 'D', 'Good')
           ;

You can find the gender breakdown pretty easily, by using GROUP BY:

SELECT gender, COUNT(*) FROM People GROUP BY gender;

But, what if you're interested in finding out the breakdown by several factors, not just gender? You could issue multiple queries. With PostgreSQL 9.5, you also can use GROUPING SETS. This is used with GROUP BY and lets you pass a list of the columns by which you want to group. Thus, you can write:

SELECT species, gender, side, COUNT(*)
  FROM People
GROUP BY GROUPING SETS (species, gender, side);

The result is a table, showing how many humans vs. droids, male vs. female vs. droid, and good vs. evil are in the movie:

+---------+--------+------+-------+
| species | gender | side | count |
+---------+--------+------+-------+
| Droid   |        |      |     3 |
| Human   |        |      |     8 |
|         |        | Evil |     2 |
|         |        | Good |     9 |
|         | D      |      |     3 |
|         | F      |      |     3 |
|         | M      |      |     5 |
+---------+--------+------+-------+

If you want to get a count of all records, you can add an empty set of parentheses to your call to GROUPING SETS:

SELECT species, gender, side, COUNT(*)
  FROM People
GROUP BY GROUPING SETS (species, gender, side, ());

Now you'll get an additional row, indicating how many total rows are in the table. However, that row will have NULL values in each column, so if your other columns contain NULL values, you might have some ambiguity problems.

GROUPING SETS can take multi-layer sets, as well:

SELECT species, gender, side, COUNT(*)
    FROM People
GROUP BY GROUPING SETS ((species, gender),
                        (species, gender, side));

What if you don't want to see the numbers from individual columns, but combinations? That is, maybe you want to find how many good females there are or how many male humans. If you're running an on-line advertising system, you might want to provide your users (à la Facebook's ad manager) with the ability to break down advertising by country, gender, interests or something else. For such cases, now there is the CUBE facility, which provides all of the possible combinations for GROUPING SETS:

SELECT species, gender, side, COUNT(*)
  FROM People
GROUP BY CUBE (species, gender, side);

The result looks like this:

+---------+--------+------+-------+
| species | gender | side | count |
+---------+--------+------+-------+
| Droid   | D      | Good |     3 |
| Droid   | D      |      |     3 |
| Droid   |        |      |     3 |
| Human   | F      | Evil |     1 |
| Human   | F      | Good |     2 |
| Human   | F      |      |     3 |
| Human   | M      | Evil |     1 |
| Human   | M      | Good |     4 |
| Human   | M      |      |     5 |
| Human   |        |      |     8 |
|         |        |      |    11 |
+---------+--------+------+-------+

You can think of CUBE as providing all of the permutations of columns.

A similar type of analysis, known as ROLLUP, also breaks things down in multiple layers, starting with the full list, then all but the final one, then all but the final two, until you get down to an empty list. This is useful when you have a hierarchy—imagine a salary table in which each person's location, division and team are indicated. You then could get total salary (with SUM) or average salary (with AVG) across those different layers.

These new options to GROUP BY were added to help people using PostgreSQL for their data analysis in the ever-growing world of “big data” and data science. Some commercial databases have offered this functionality for some time, and now it has been added to PostgreSQL as well.

Track Commit Timestamp

This is a small feature, but one that will be welcome in many quarters. For years, Ruby on Rails has added created_at and modified_at columns to every ActiveRecord model, because it's so useful to have the timestamp at which a record was either created or modified. PostgreSQL 9.5 optionally allows you to add this to any table.

You must activate this feature in the postgresql.conf configuration file, and if you change its value, you need to restart PostgreSQL in order to see the effects. When activated, the line will look like this:

track_commit_timestamp = one

Now, when you create a new table and add some rows:

CREATE TABLE Stuff (
    id SERIAL PRIMARY KEY,
    thing TEXT NOT NULL
);


INSERT INTO Stuff (thing) values ('a');
INSERT INTO Stuff (thing) values ('b');
INSERT INTO Stuff (thing) values ('c');

A normal SELECT on that table will give the following:

[local]/reuven=# select * from stuff;
+----+-------+
| id | thing |
+----+-------+
|  1 | a     |
|  2 | b     |
|  3 | c     |
+----+-------+
(3 rows)

But, if you use the pg_xact_commit_timestamp function on the normally hidden xmin column, you can find the timestamp for each row:

[local]/reuven=# SELECT pg_xact_commit_timestamp(xmin), * FROM stuff;
+-------------------------------+----+-------+
|   pg_xact_commit_timestamp    | id | thing |
+-------------------------------+----+-------+
| 2016-03-24 12:07:16.591932+02 |  1 | a     |
| 2016-03-24 12:07:16.592771+02 |  2 | b     |
| 2016-03-24 12:07:16.593563+02 |  3 | c     |
+-------------------------------+----+-------+

Remember that the timestamp is held frozen for a transaction. Thus, if you INSERT several rows at the same time, they'll have the same timestamp:

INSERT INTO Stuff (thing) values ('d'), ('e'), ('f');

[local]/reuven=# SELECT pg_xact_commit_timestamp(xmin), * from  stuff;
+-------------------------------+----+-------+
|   pg_xact_commit_timestamp    | id | thing |
+-------------------------------+----+-------+
| 2016-03-24 12:07:16.591932+02 |  1 | a     |
| 2016-03-24 12:07:16.592771+02 |  2 | b     |
| 2016-03-24 12:07:16.593563+02 |  3 | c     |
| 2016-03-24 12:10:15.647167+02 |  4 | d     |
| 2016-03-24 12:10:15.647167+02 |  5 | e     |
| 2016-03-24 12:10:15.647167+02 |  6 | f     |
+-------------------------------+----+-------+

And Much More

Those are the features I'll be using the most, but PostgreSQL 9.5 includes a lot more than what I have space to describe here. Foreign data wrappers, which allow PostgreSQL to talk to other databases, have gotten much smarter, including the ability to import foreign schemas. JSON operators have become more sophisticated, making PostgreSQL into (ironically) one of the fastest and most fully featured NoSQL databases. New BRIN indexes are a good compromise between speed, accuracy and size. And of course, there are numerous performance improvements as well.

And although it's still too early to talk about it seriously, there already has been discussion of whether the next version will be called 9.6 or 10.0—in part because it looks like the next version will include some truly killer features. About a year from now, we'll be able to explore those and see just how well they make this amazing database even more amazing.

Conclusion

As I have learned to expect, the latest version of PostgreSQL offers a host of a new features and enhancements that make it not just a rock-solid database on which to run your operations, but also one that offers a great deal of flexibility to do so. If you're already a die-hard PostgreSQL user, you'll probably enjoy these new features and should plan to upgrade soon. And if not, well, then maybe you'll take a look at it!

Reuven M. Lerner offers training in Python, Git and PostgreSQL to companies around the world. He blogs at blog.lerner.co.il, tweets at @reuvenmlerner and curates DailyTechVideo.com. Reuven lives in Modi'in, Israel, with his wife and three children.