Thinking NoSQL? Believe it or not, PostgreSQL might be a great choice.
One of the most interesting trends in the computer world during the past few years has been the rapid growth of NoSQL databases. The term may be accurate, in that NoSQL databases don't use SQL in order to store and retrieve data, but that's about where the commonalities end. NoSQL databases range from key-value stores to columnar databases to document databases to graph databases.
On the face of it, nothing sounds more natural or reasonable than a NoSQL database. The “impedance mismatch” between programming languages and databases, as it often is described, means that we generally must work in two different languages, and in two different paradigms. In our programs, we think and work with objects, which we carefully construct. And then we deconstruct those objects, turning them into two-dimensional tables in our database. The idea that I can manipulate objects in my database in the same way as I can in my program is attractive at many levels.
In some ways, this is the holy grail of databases: we want something that is rock-solid reliable, scalable to the large proportions that modern Web applications require and also convenient to us as programmers. One popular solution is an ORM (object-relational mapper), which allows us to write our programs using objects. The ORM then translates those objects and method calls into the appropriate SQL, which it passes along to the database. ORMs certainly make it more convenient to work with a relational database, at least when it comes to simple queries. And to no small degree, they also improve the readability of our code, in that we can stick with our objects, without having to use a combination of languages and paradigms.
But ORMs have their problems as well, in no small part because they can shield us from the inner workings of our database. NoSQL advocates say that their databases have solved these problems, allowing them to stay within a single language. Actually, this isn't entirely true. MongoDB has its own SQL-like query language, and CouchDB uses JavaScript. But there are adapters that do similar ORM-like translations for many NoSQL databases, allowing developers to stay within a single language and paradigm when developing.
The ultimate question, however, is whether the benefits of NoSQL databases outweigh their issues. I have largely come to the conclusion that, with the exception of key-value stores, the answer is “no”—that a relational database often is going to be a better solution. And by “better”, I mean that relational databases are more reliable, and even more scalable, than many of their NoSQL cousins. Sure, you might need to work hard in order to get the scaling to work correctly, but there is no magic solution. In the past few months alone, I've gained several new clients who decided to move from NoSQL solutions to relational databases, and needed help with the architecture, development or optimization.
The thing is, even the most die-hard relational database fan will admit there are times when NoSQL data stores are convenient. With the growth of JSON in Web APIs, it would be nice to be able to store the result sets in a storage type that understands that format and allows me to search and retrieve from it. And even though key-value stores, such as Redis, are powerful and fast, there are sometimes cases when I'd like to have the key-value pairs connected to data in other relations (tables) in my database.
If this describes your dilemma, I have good news for you. As I write this, PostgreSQL, an amazing database and open-source project, is set to release version 9.4. This new version, like all other PostgreSQL versions, contains a number of optimizations, improvements and usability features. But two of the most intriguing features to me are HStore and JSONB, features that actually turn PostgreSQL into a NoSQL database.
Fine, perhaps I'm exaggerating a bit here. PostgreSQL was and always will be relational and transactional, and adding these new data types hasn't changed that. But having a key-value store within PostgreSQL opens many new possibilities for developers. JSONB, a binary version of JSON storage that supports indexing and a large number of operators, turns PostgreSQL into a document database, albeit one with a few other features in it besides.
In this article, I introduce these NoSQL features that are included in PostgreSQL 9.4, which likely will be released before this issue of Linux Journal gets to you. Although not every application needs these features, they can be useful—and with this latest release of PostgreSQL, the performance also is significantly improved.
One of the most interesting new developments in PostgreSQL is that of HStore, which provides a key-value store within the PostgreSQL environment. Contrary to what I originally thought, this doesn't mean that PostgreSQL treats a particular table as a key-value store. Rather, HStore is a data type, akin to INTEGER, TEXT and XML. Thus, any column—or set of columns—within a table may be defined to be of type HSTORE. For example:
CREATE TABLE People ( id SERIAL, info HSTORE, PRIMARY KEY(id) );
Once I have done that, I can ask PostgreSQL to show me the definition of the table:
\d people Table "public.people" ----------------------------------------------------------------- | Column | Type | Modifiers | ----------------------------------------------------------------- | id | integer | not null default | | | | ↪nextval('people_id_seq'::regclass)| ----------------------------------------------------------------- | info | hstore | | ----------------------------------------------------------------- Indexes: "people_pkey" PRIMARY KEY, btree (id)
As you can see, the type of my “info” column is hstore. What I have effectively created is a (database) table of hash tables. Each row in the “people” table will have its own hash table, with any keys and values. It's typical in such a situation for every row to have the same key names, or at least some minimum number of overlapping key names, but you can, of course, use any keys and values you like.
Both the keys and the values in an HStore column are text strings. You can assign a hash table to an HStore column with the following syntax:
INSERT INTO people(info) VALUES ('foo=>1, bar=>abc, baz=>stuff');
Notice that although this example inserts three key-value pairs into the HStore column, they are stored together, converted automatically into an HStore, splitting the pairs where there is a comma, and each pair where there is a => sign.
So far, you won't see any difference between an HStore and a TEXT column, other than (perhaps) the fact that you cannot use text functions and operators on that column. For example, you cannot use the || operator, which normally concatenates text strings, on the HStore:
UPDATE People SET info = info || 'abc'; ERROR: XX000: Unexpected end of string LINE 1: UPDATE People SET info = info || 'abc'; ^
PostgreSQL tries to apply the || operator to the HStore on the left, but cannot find a key-value pair in the string on the right, producing an error message. However, you can add a pair, which will work:
UPDATE People SET info = info || 'abc=>def';
As with all hash tables, HStore is designed for you to use the keys to retrieve the values. That is, each key exists only once in each HStore value, although values may be repeated. The only way to retrieve a value is via the key. You do this with the following syntax:
SELECT info->'bar' FROM People; ---------------- | ?column? | | ---------------- | abc | | ---------------- (1 row)
Notice several things here. First, the name of the column remains without any quotes, just as you do when you're retrieving the full contents of the column. Second, you put the name of the key after the -> arrow, which is different from the => (“hashrocket”) arrow used to delineate key-value pairs within the HStore. Finally, the returned value always will be of type TEXT. This means if you say:
SELECT info->'foo' || 'a' FROM People; ---------------- | ?column? | | ---------------- | 1a | | ---------------- (1 row)
Notice that ||, which works on text values, has done its job here. However, this also means that if you try to multiply your value, you will get an error message:
SELECT info->'foo' * 5 FROM People; info->'foo' * 5 from people; ^ Time: 5.041 ms
If you want to retrieve info->'foo' as an integer, you must cast that value:
SELECT (info->'foo')::integer * 5 from people; ---------------- | ?column? | | ---------------- | 5 | | ---------------- (1 row)
Now, why is HStore so exciting? In particular, if you're a database person who values normalization, you might be wondering why someone even would want this sort of data store, rather than a nicely normalized table or set of tables.
The answer, of course, is that there are many different uses for a database, and some of them can be more appropriate for an HStore. I never would suggest storing serious data in such a thing, but perhaps you want to keep track of user session information, without keeping it inside of a binary object.
Now, HStore is not new to PostgreSQL. The big news in version 9.4 is that GiN and GIST indexes now support HStore columns, and that they do so with great efficiency and speed.
Where do I plan to use HStore? To be honest, I'm not sure yet. I feel like this is a data type that I likely will want to use at some point, but for now, it's simply an extra useful, efficient tool that I can put in my programming toolbox. The fact that it is now extremely efficient, and its operators can take advantage of improved indexes, means that HStore is not only convenient, but speedy, as well.
It has long been possible to store JSON inside PostgreSQL. After all, JSON is just a textual representation of JavaScript objects (“JavaScript Object Notation”), which means that they are effectively strings. But of course, when you store data in PostgreSQL, you would like a bit more than that. You want to ensure that stored data is valid, as well as use PostgreSQL's operators to retrieve and work on that data.
PostgreSQL has had a JSON data type for several years. The data type started as a simple textual representation of JSON, which would check for valid contents, but not much more than that. The 9.3 release of PostgreSQL allowed you to use a larger number of operators on your JSON columns, making it possible to retrieve particular parts of the data with relative ease.
However, the storage and retrieval of JSON data was never that efficient, and the JSON-related operators were particularly bad on this front. So yes, you could look for a particular name or value within a JSON column, but it might take a while.
That has changed with 9.4, with the introduction of the JSONB data type, which stores JSON data in binary form, such that it is both more compact and more efficient than the textual form. Moreover, the same GIN and GIST indexes that now are able to work so well with HStore data also are able to work well, and quickly, with JSONB data. So you can search for and retrieve text from JSONB documents as easily (or more) as would have been the case with a document database, such as MongoDB.
I already have started to use JSONB in some of my work. For example, one of the projects I'm working on contacts a remote server via an API. The server returns its response in JSON, containing a large number of name-value pairs, some of them nested. (I should note that using a beta version of PostgreSQL, or any other infrastructural technology, is only a good idea if you first get the client's approval, and explain the risks and benefits.)
Now, I'm a big fan of normalized data. And I'm not a huge fan of storing JSON in the database. But rather than start to guess what data I will and won't need in the future, I decided to store everything in a JSONB column for now. If and when I know precisely what I'll need, I will normalize the data to a greater degree.
Actually, that's not entirely true. I knew from the start that I would need two different values from the response I was receiving. But because I was storing the data in JSONB, I figured it would make sense for me simply to retrieve the data from the JSONB column.
Having stored the data there, I then could retrieve data from the JSON column:
SELECT id, email, personal_data->>'surname' AS surname personal_data->>'forename' as given_name FROM ID_Checks WHERE personal_data->>'surname' ilike '%lerner%';
Using the double-arrow operator (->>), I was able to retrieve the value of a JSON object by using its key. Note that if you use a single arrow (->), you'll get an object back, which is quite possibly not what you want. I've found that the text portion is really what interests me most of the time.
People use NoSQL databases for several reasons. One is the impedance mismatch between objects and tables. But two other common reasons are performance and convenience. It turns out that modern versions of PostgreSQL offer excellent performance, thanks to improved data types and indexes. But they also offer a great deal of convenience, letting you set, retrieve and delete JSON and key-value data easily, efficiently and naturally.
I'm not going to dismiss the entire NoSQL movement out of hand. But I will say that the next time you're thinking of using a NoSQL database, consider using one that can already fulfill all of your needs, and which you might well be using already—PostgreSQL.