An introduction to the basics of Unicode, distilled from several earlier posts. In the interests of presenting the big picture, I have painted with a broad brush — large areas are summarized; nits are not picked; hairs are not split; wind resistance is ignored.
Unicode = one character set, plus several encodings
Unicode is actually not one thing, but two separate and distinct things. The first is a character set and the second is a set of encodings.
- The first — the idea of a character set — has absolutely nothing to do with computers.
- The second — the idea of encodings for the Unicode character set — has everything to do with computers.
The idea of a character set has nothing to do with computers. So let’s suppose that you’re a British linguist living in, say, 1750. The British Empire is expanding and Europeans are discovering many new languages, both living and dead. You’ve known about Chinese characters for a long time, and you’ve just discovered Sumerian cuneiform characters from the Middle East and Sanskrit characters from India.
Trying to deal with this huge mass of different characters, you get a brilliant idea — you will make a numbered list of every character in every language that ever existed.
You start your list with your own familiar set of English characters — the upper- and lower-case letters, the numeric digits, and the various punctuation marks like period (full stop), comma, exclamation mark, and so on. And the space character, of course.
01 a 02 b 03 c ... 26 z 27 A 28 B ... 52 Z 53 0 54 1 55 2 ... 62 9 63 (space) 64 ? (question mark) 65 , (comma) ... and so on ...
Then you add the Spanish, French and German characters with tildes, accents, and umlauts. You add characters from other living languages — Greek, Japanese, Chinese, Korean, Sanscrit, Arabic, Hebrew, and so on. You add characters from dead alphabets — Assyrian cuneiform — and so on, until finally you have a very long list of characters.
- What you have created — a numbered list of characters — is known as a character set.
- The numbers in the list — the numeric identifiers of the characters in the character set — are called code points.
- And because your list is meant to include every character that ever existed, you call your character set the Universal Character Set.
Congratulations! You’ve just invented (something similar to) the the first half of Unicode — the Universal Character Set or UCS.
Now suppose you jump into your time machine and zip forward to the present. Everybody is using computers. You have a brilliant idea. You will devise a way for computers to handle UCS.
You know that computers think in ones and zeros — bits — and collections of 8 bits — bytes. So you look at the biggest number in your UCS and ask yourself: How many bytes will I need to store a number that big? The answer you come up with is 4 bytes, 32 bits. So you decide on a simple and straight-forward digital implementation of UCS — each number will be stored in 4 bytes. That is, you choose a fixed-length encoding in which every UCS character (code point) can be represented, or encoded, in exactly 4 bytes, or 32 bits.
UTF-8 and variable-length encodings
UCS-4 is simple and straight-forward… but inefficient. Computers send a lot of strings back and forth, and many of those strings use only ASCII characters — characters from the old ASCII character set. One byte — eight bits — is more than enough to store such characters. It is grossly inefficient to use 4 bytes to store an ASCII character.
The key to the solution is to remember that a code point is nothing but a number (an integer). It may be a short number or a long number, but it is only a number. We need just one byte to store the shorter numbers of the Universal Character Set, and we need more bytes only when the numbers get longer. So the solution to our problem is a variable-length encoding.
Specifically, Unicode’s UTF-8 (Unicode Transformation Format, 8 bit) is a variable-length encoding in which each UCS code point is encoded using 1, 2, 3, or 4 bytes, as necessary.
In UTF-8, if the first bit of a byte is a “0″, then the remaining 7 bits of the byte contain one of the 128 original 7-bit ASCII characters. If the first bit of the byte is a “1″ then the byte is the first of multiple bytes used to represent the code point, and other bits of the byte carry other information, such as the total number of bytes — 2, or 3, or 4 bytes — that are being used to represent the code point. (For a quick overview of how this works at the bit level, see How does UTF-8 “variable-width encoding” work?)
Just use UTF-8
UTF-8 is a great technology, which is why it has become the de facto standard for encoding Unicode text, and is the most widely-used text encoding in the world. Text strings that use only ASCII characters can be encoded in UTF-8 using only one byte per character, which is very efficient. And if characters — Chinese or Japanese characters, for instance — require multiple bytes, well, UTF-8 can do that, too.
Byte Order Mark
Unicode fixed-length multi-byte encodings such as UTF-16 and UTF-32 store UCS code points (integers) in multi-byte chunks — 2-byte chunks in the case of UTF-16 and 4-byte chunks in the case of UTF-32.
Unfortunately, different computer architectures — basically, different processor chips — use different techniques for storing such multi-byte integers. In “little-endian” computers, the “little” (least significant) byte of a multi-byte integer is stored leftmost. “Big-endian” computers do the reverse; the “big” (most significant) byte is stored leftmost.
- Intel computers are little-endian.
- Motorola computers are big-endian.
- Microsoft Windows was designed around a little-endian architecture — it runs only on little-endian computers or computers running in little-endian mode — which is why Intel hardware and Microsoft software fit together like hand and glove.
Differences in endian-ness can create data-exchange issues between computers. Specifically, the possibility of differences in endian-ness means that if two computers need to exchange a string of text data, and that string is encoded in a Unicode fixed-length multi-byte encoding such as UTF-16 or UTF-32, the string should begin with a Byte Order Mark (or BOM) — a special character at the beginning of the string that indicates the endian-ness of the string.
Strings encoded in UTF-8 don’t require a BOM, so the BOM is basically a non-issue for programmers who use only UTF-8.
- Ned Batchelder’s Pragmatic Unicode. Highly recommended.
- The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) (2003) by Joel Spolsky is good, and widely read, but now a bit dated. I think it is rather misleading in the prominence it gives to the BOM.