At the Forge

Unicode

Reuven M. Lerner

Issue #230, June 2013

Support legacy data and users better by understanding Unicode.

Let's give credit where credit's due: Unicode is a brilliant invention that makes life easier for millions—even billions—of people on our planet. At the same time, dealing with Unicode, as well as the various encoding systems that preceded it, can be an incredibly painful and frustrating experience. I've been dealing with some Unicode-related frustrations of my own in recent days, so I thought this might be a good time to revisit a topic that every modern software developer, and especially every Web developer, should understand.

In case you don't know what Unicode is, or how it affects you, consider this: in C and in older versions of languages like Python and Ruby, a string is nothing more than a bunch of bytes. There's no rhyme or reason to it; you can read whatever data you want into a string, and the language will be fine with it. For example, if I fire up iPython (which uses Python 2.7), I can read a JPEG image into a string:

s = open('Downloads/test.jpg').read()

Most of the time, you use strings not to hold JPEG images, but rather to hold text. If your text is all in English, you're in luck, because all the characters used by the English language are defined in ASCII, a standard that defines 128 different characters, each with a unique number. Thus, character 65 is uppercase A, and the space character is number 32. ASCII is great, and it works just fine—until you want to start using languages other than English.

The problem is most languages require characters that are not used in English, and that aren't defined in ASCII. This means if you want to write words in French, let alone in Arabic or Chinese, you won't have a way to represent characters using ASCII.

A solution for alphabetic languages was a set of ISO standards (ISO 8859-*), which took advantage of the fact that ASCII uses only 7 bits, but that data is transmitted with 8 bits. If you can take advantage of all 8 bits, you double the number of available characters, from 128 to 256. This is more than enough for languages with a defined alphabet. Thus, Western European languages were defined in ISO-8859-1, Hebrew in ISO-8859-8 and so forth. Moreover, these ISO standards were meant to make it possible to mix the “foreign” language with English. Thus, you could have a document with English and French or English and Arabic. ASCII characters retained their original values, and the non-ASCII characters were defined in the upper 128.

But, what happens when you want to have a document that contains English, Arabic and French? In the ISO-8859 family of standards, there wasn't any way to accomplish this. The same number that was used to describe an accented character in French also would be used to describe a character in Arabic. The program displaying the text in question was responsible for deciding which language and, thus, which characters, would be displayed. A document written in Russian (ISO 8859-5) but displayed by a program expecting Hebrew (ISO 8859-8) would show Hebrew characters, or rather, gibberish.

Things are even worse if you're working with non-alphabetic languages, such as Chinese. Even if you would like to use the upper-128 characters to write Chinese, you would be forced to choose from a tiny percentage of the characters that are necessary to use the language. Clearly, something else would be necessary, and indeed, the Chinese (as well as Japanese) invented their own systems for storing text on computers, which were completely incompatible with ASCII.

Unicode was designed to solve all of these problems. Simply put, it gives every individual human-designed character its own unique number, or “code point”. Doing this removes the ambiguity associated with displaying text. So long as a program supports Unicode, it doesn't need to know the language family that's being used. English, French, Arabic and Russian all can coexist on the same page, without any interference between the characters. Moreover, Unicode supports a very large number of code points, allowing Chinese and Japanese characters to coexist with alphabetic characters.

Encodings

So far, so good. But, switching over to this new system raised two questions. First, how do you take these individual code points, uniquely identifying just about every character humans have created, and translate them into bytes? Second, what happens to existing documents, which weren't written in Unicode?

On the one hand, the answers to those questions are relatively straightforward. On the other hand, the answers lead to much of the frustration associated with using Unicode—not because Unicode itself is bad or difficult, but because the mix of different, existing encodings with a Unicode-based system can be frustrating.

The first question, how do you encode the various Unicode characters using bytes, has multiple answers. If you're using a Unicode-aware language, you no longer can think of characters as being equivalent to bytes. Rather, one character might be a single byte, but it also might be multiple bytes. In the UCS-32 encoding scheme, for example, each Unicode character uses 4 bytes. This provides enough space for all of the defined Unicode characters, which is a good thing, but it also breaks backward compatibility with ASCII documents and quadruples the size of anything written using ASCII or any of the ISO-8859 series.

For these reasons, the de facto standard in the Unicode world is UTF-8, a variable-length encoding scheme invented by famed programmers Rob Pike and Ken Thompson. The basic idea is that all defined ASCII characters, from 0–127, remain as they were. If the high (8th) bit is set, that indicates the character consumes an additional byte (that is, two bytes for the character). In a similar way, high bits are used on succeeding bytes to indicate that the character's description has not ended. In this way, UTF-8 characters can consume as little as one byte (for ASCII characters) or as many as 6 bytes for truly unusual characters. Languages like Chinese and Japanese will require 4 bytes per character.

UTF-8 provides the best of all possible worlds—ASCII documents remain as they were, alphabetic languages don't use too many more bytes than necessary, you resolve ambiguity with Unicode, and you can represent all Unicode characters. But, it does introduce a new problem: strings can now be invalid! If you were to use the fixed-width UCS-32 system, just about every byte would point to a valid character. But in UTF-8, it's possible to have a sequence of bytes that's invalid according to this encoding scheme.

To return to my example from earlier in this article, let's say I execute the following code in Python 3, rather than Python 2.7:

s = open('Downloads/test.jpg').read()

Now, in Python 2.7, strings are just collections of bytes. If I want to use Unicode, I need to use a “Unicode string”, a special version of the str type in which characters are all in Unicode (and stored in UTF-8). In Python 3, the default string encoding is UTF-8, which means that executing the above code actually will result in an exception:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in
position 0: invalid start byte

In other words, Python was expecting to get input in UTF-8, but noticed the byte 0xFF at the start of the file, which is illegal. What you need to do is tell Python that you want to read the file in binary format, by opening it in “read binary” mode:

s = open('/Users/reuven/Downloads/test.jpg', mode='rb').read()

Now, given that you've read the file in binary mode, you're treating it as bytes, rather than a string. And sure enough, if you ask Python what type of data was returned:


>>> type(s)
<class 'bytes'>

In other words, Python won't create an illegal string. So instead of doing so, read() returns a bytestring, which is roughly the same as the Python 2.x string.

That covers files that were written in Unicode. But what about files written in another encoding scheme, such as ISO-8859-5? In such a case, you need to pass another parameter to “open”, indicating the encoding you should use.

Ruby has undergone a similar change in the past few years. Ruby 1.8 saw strings as collections of bytes, but it really didn't think or care much about Unicode and other encodings. Ruby 1.9 (as well as 2.0) made a shift in a similar direction to Python, such that every string has an encoding associated with it. Unlike Python, you can read binary data into a Ruby 2.0 string, and the language will be fine with that:

s = File.read('Downloads/test.jpg')

If you ask Ruby what sort of object was returned, it'll tell you that it was a string:


>> s.class #=> String
>> s.encoding #=> #<Encoding:UTF-8>
>> s.valid_encoding? #=> false

But, you then can set the encoding to something else:


>> s.force_encoding(Encoding.find('ASCII-8BIT'))
>> s.encoding #=> #<Encoding:ASCII-8BIT>
>> s.valid_encoding? #=> true

Web Development and Unicode

All of this is well and good, but how does it affect Web developers? Again, none of this would be a problem if you magically could flick a switch and have all documents and computers switch to using UTF-8. But, that's far from the case. Not only are there many documents out there that were written in non-UTF-8 formats, but also there are many computers whose encoding is still not UTF-8.

This means if you have an HTML form and you accept input from users' browsers, you likely will get input from users' browsers in whatever encoding system their computers are using. True, most modern computers and browsers use UTF-8, but you would be amazed by how many old systems exist. You should experiment with your Web application, ensuring that even when someone sends you data in a non-Unicode system, you still can handle it (or gracefully deal with the failure).

Another issue I recently encountered myself wasn't directly from user input, but rather files that users were uploading. My Web application worked in UTF-8, and everything seemed to be humming along—until it wasn't. The problem was that part of the application involved people uploading text files. I would read the contents of the file into a string and then store that string in a database. Unfortunately, the application would raise an exception, because the text files—coming from people around the world, in different languages and using many different encodings—often were incompatible with UTF-8. One solution would have been to try to identify the encoding of the uploaded file. In my particular case, I was able to catch the exception and report it to the user, indicating that only files in UTF-8 were acceptable. Whether such an error message will suffice for your application depends on what you're doing.

And yes, that leads me to my next point, namely databases. All of the major relational and NoSQL databases with which I work support UTF-8 as a default. PostgreSQL, for example, gives each database an encoding, indicating the encoding that will be used in text columns. The good news is that this ensures that all text stored in the database will be valid UTF-8, or whatever other encoding you use. The bad news (to some degree) is that if you want to store both binary and textual data in the same column, you'll have to find another solution. Binary data, such as the contents of a JPEG file, cannot be stored in a text column, because it's not legal UTF-8. Instead, you'll need to store such information in a binary BYTEA column, which accepts any sequence of bytes and doesn't attempt to ensure its validity. Fortunately, the drivers with which I work understand the difference between TEXT and BYTEA columns and return results using appropriate data types.

Realize that there is a difference, however, between encoding and collation. Encoding refers to the way UTF-8 (or any other character set) is translated into a series of bytes. Collation refers to how the text is sorted and, thus, is language-dependent. Consider that sorting a list of 100 words will have different results in English, Spanish and French, and you'll understand that your application's needs (and users) will determine, to a large degree, which collation, if any, you choose to use.

Conclusion

Just about ten years ago, I worked on a multilingual site that required Unicode, and my decision to use it caused a great deal of friction with others working on the project, because they didn't have editors that supported UTF-8.

Things are quite different today. Just about every piece of Web-related software supports Unicode, from the operating system and language to the database and browser. However, the numerous non-Unicode computers, programs and files out there require that you keep them in mind and are able to work with them. Moreover, working with binary files and data means that you need to get out of the mindset that “everything can be a string”, because modern strings are picky about the data they will let you store.

Understanding Unicode is essential to knowing how modern Web applications work. Once you've made sure your application is using the right methods and checking the data in the right places, it'll work just fine with users from around the world.

Web developer, trainer and consultant Reuven M. Lerner is finishing his PhD in Learning Sciences at Northwestern University. He lives in Modi'in, Israel, with his wife and three children. You can read more about him at lerner.co.il, or contact him at reuven@lerner.co.il.