Unicode – the basics

An introduction to the basics of Unicode, distilled from several earlier posts. In the interests of presenting the big picture, I have painted with a broad brush — large areas are summarized; nits are not picked; hairs are not split; wind resistance is ignored.

Unicode = one character set, plus several encodings

Unicode is actually not one thing, but two separate and distinct things. The first is a character set and the second is a set of encodings.

  • The first — the idea of a character set — has absolutely nothing to do with computers.
  • The second — the idea of encodings for the Unicode character set — has everything to do with computers.

Character sets

The idea of a character set has nothing to do with computers. So let’s suppose that you’re a British linguist living in, say, 1750. The British Empire is expanding and Europeans are discovering many new languages, both living and dead. You’ve known about Chinese characters for a long time, and you’ve just discovered Sumerian cuneiform characters from the Middle East and Sanskrit characters from India.

Trying to deal with this huge mass of different characters, you get a brilliant idea — you will make a numbered list of every character in every language that ever existed.

You start your list with your own familiar set of English characters — the upper- and lower-case letters, the numeric digits, and the various punctuation marks like period (full stop), comma, exclamation mark, and so on. And the space character, of course.

01 a
02 b
03 c
...
26 z
27 A
28 B
...
52 Z
53 0
54 1
55 2
...
62 9
63 (space)
64 ? (question mark)
65 , (comma)
... and so on ...

Then you add the Spanish, French and German characters with tildes, accents, and umlauts. You add characters from other living languages — Greek, Japanese, Chinese, Korean, Sanscrit, Arabic, Hebrew, and so on. You add characters from dead alphabets — Assyrian cuneiform — and so on, until finally you have a very long list of characters.

  • What you have created — a numbered list of characters — is known as a character set.
  • The numbers in the list — the numeric identifiers of the characters in the character set — are called code points.
  • And because your list is meant to include every character that ever existed, you call your character set the Universal Character Set.

Congratulations! You've just invented (something similar to) the the first half of Unicode — the Universal Character Set or UCS.

Encodings

Now suppose you jump into your time machine and zip forward to the present. Everybody is using computers. You have a brilliant idea. You will devise a way for computers to handle UCS.

You know that computers think in ones and zeros — bits — and collections of 8 bits — bytes. So you look at the biggest number in your UCS and ask yourself: How many bytes will I need to store a number that big? The answer you come up with is 4 bytes, 32 bits. So you decide on a simple and straight-forward digital implementation of UCS — each number will be stored in 4 bytes. That is, you choose a fixed-length encoding in which every UCS character (code point) can be represented, or encoded, in exactly 4 bytes, or 32 bits.

In short, you devise the Unicode UCS-4 (Universal Character Set, 4 bytes) encoding, aka UTF-32 (Unicode Transformation Format, 32 bits).

UTF-8 and variable-length encodings

UCS-4 is simple and straight-forward... but inefficient. Computers send a lot of strings back and forth, and many of those strings use only ASCII characters — characters from the old ASCII character set. One byte — eight bits — is more than enough to store such characters. It is grossly inefficient to use 4 bytes to store an ASCII character.

The key to the solution is to remember that a code point is nothing but a number (an integer). It may be a short number or a long number, but it is only a number. We need just one byte to store the shorter numbers of the Universal Character Set, and we need more bytes only when the numbers get longer. So the solution to our problem is a variable-length encoding.

Specifically, Unicode's UTF-8 (Unicode Transformation Format, 8 bit) is a variable-length encoding in which each UCS code point is encoded using 1, 2, 3, or 4 bytes, as necessary.

In UTF-8, if the first bit of a byte is a "0", then the remaining 7 bits of the byte contain one of the 128 original 7-bit ASCII characters. If the first bit of the byte is a "1" then the byte is the first of multiple bytes used to represent the code point, and other bits of the byte carry other information, such as the total number of bytes — 2, or 3, or 4 bytes — that are being used to represent the code point. (For a quick overview of how this works at the bit level, see How does UTF-8 "variable-width encoding" work?)

Just use UTF-8

UTF-8 is a great technology, which is why it has become the de facto standard for encoding Unicode text, and is the most widely-used text encoding in the world. Text strings that use only ASCII characters can be encoded in UTF-8 using only one byte per character, which is very efficient. And if characters — Chinese or Japanese characters, for instance — require multiple bytes, well, UTF-8 can do that, too.

Byte Order Mark

Unicode fixed-length multi-byte encodings such as UTF-16 and UTF-32 store UCS code points (integers) in multi-byte chunks — 2-byte chunks in the case of UTF-16 and 4-byte chunks in the case of UTF-32.

Unfortunately, different computer architectures — basically, different processor chips — use different techniques for storing such multi-byte integers. In "little-endian" computers, the "little" (least significant) byte of a multi-byte integer is stored leftmost. "Big-endian" computers do the reverse; the "big" (most significant) byte is stored leftmost.

  • Intel computers are little-endian.
  • Motorola computers are big-endian.
  • Microsoft Windows was designed around a little-endian architecture — it runs only on little-endian computers or computers running in little-endian mode — which is why Intel hardware and Microsoft software fit together like hand and glove.

Differences in endian-ness can create data-exchange issues between computers. Specifically, the possibility of differences in endian-ness means that if two computers need to exchange a string of text data, and that string is encoded in a Unicode fixed-length multi-byte encoding such as UTF-16 or UTF-32, the string should begin with a Byte Order Mark (or BOM) — a special character at the beginning of the string that indicates the endian-ness of the string.

Strings encoded in UTF-8 don't require a BOM, so the BOM is basically a non-issue for programmers who use only UTF-8.


Resources

About these ads

3 thoughts on “Unicode – the basics

  1. I don’t know what a “leftmost” address is. I asked my computer, it didn’t know either.

    • A reasonable question! :-)

      I discussed the BOM more extensively in an earlier post. In that post I noted that the terms “big endian” and “little endian” were introduced in 1980 by Danny Cohen in a paper called “On Holy Wars and a Plea for Peace”. I believer that that paper is also the source for the technical use of the terms “left” and “right” when discussing computer architecture.

      Here are a few passages from that paper.

      Computer memory may be viewed as a linear sequence of bits, divided into bytes, words, pages and so on. Each unit is a subunit of the next level. This is, obviously, a hierarchical organization.

      <snip>

      We would like to illustrate the hierarchically consistent order graphically, but first we have to decide about the order in which computer words are written on paper. Do they go from left to right, or from right to left?

      The English language, like most modern languages, suggests that we lay these computer words on paper from left to right, like this:

                       |---word0---|---word1---|---word2---|....
      

      <snip>

      If we also use the traditional convention, as introduced by our numbering system, the wide-end is on the left and the narrow-end is on the right.

      <snip>

      Note that there is absolutely no specific significance whatsoever to the notion of "left" and "right" in bit order in a computer memory. One could think about it as "up" and "down" for example, or mirror it by systematically interchanging all the "left"s and "right"s. However, this notion stems from the concept that computer words represent numbers, and from the old mathematical tradition that the wide-end of a number (aka the MSB) is called "left" and the narrow-end of a number is called "right".

      This mathematical convention is the point of reference for the notion of "left" and "right".

      It is a fun paper, and worth a quick look. I'm certainly no expert in that field, but I enjoyed skimming it.

Comments are closed.