Unicode – the basics

An introduction to the basics of Unicode, distilled from several earlier posts. In the interests of presenting the big picture, I have painted with a broad brush — large areas are summarized; nits are not picked; hairs are not split; wind resistance is ignored.

Unicode = one character set, plus several encodings

Unicode is actually not one thing, but two separate and distinct things. The first is a character set and the second is a set of encodings.

  • The first — the idea of a character set — has absolutely nothing to do with computers.
  • The second — the idea of encodings for the Unicode character set — has everything to do with computers.

Character sets

The idea of a character set has nothing to do with computers. So let’s suppose that you’re a British linguist living in, say, 1750. The British Empire is expanding and Europeans are discovering many new languages, both living and dead. You’ve known about Chinese characters for a long time, and you’ve just discovered Sumerian cuneiform characters from the Middle East and Sanskrit characters from India.

Trying to deal with this huge mass of different characters, you get a brilliant idea — you will make a numbered list of every character in every language that ever existed.

You start your list with your own familiar set of English characters — the upper- and lower-case letters, the numeric digits, and the various punctuation marks like period (full stop), comma, exclamation mark, and so on. And the space character, of course.

01 a
02 b
03 c
...
26 z
27 A
28 B
...
52 Z
53 0
54 1
55 2
...
62 9
63 (space)
64 ? (question mark)
65 , (comma)
... and so on ...

Then you add the Spanish, French and German characters with tildes, accents, and umlauts. You add characters from other living languages — Greek, Japanese, Chinese, Korean, Sanscrit, Arabic, Hebrew, and so on. You add characters from dead alphabets — Assyrian cuneiform — and so on, until finally you have a very long list of characters.

  • What you have created — a numbered list of characters — is known as a character set.
  • The numbers in the list — the numeric identifiers of the characters in the character set — are called code points.
  • And because your list is meant to include every character that ever existed, you call your character set the Universal Character Set.

Congratulations! You’ve just invented (something similar to) the the first half of Unicode — the Universal Character Set or UCS.

Encodings

Now suppose you jump into your time machine and zip forward to the present. Everybody is using computers. You have a brilliant idea. You will devise a way for computers to handle UCS.

You know that computers think in ones and zeros — bits — and collections of 8 bits — bytes. So you look at the biggest number in your UCS and ask yourself: How many bytes will I need to store a number that big? The answer you come up with is 4 bytes, 32 bits. So you decide on a simple and straight-forward digital implementation of UCS — each number will be stored in 4 bytes. That is, you choose a fixed-length encoding in which every UCS character (code point) can be represented, or encoded, in exactly 4 bytes, or 32 bits.

In short, you devise the Unicode UCS-4 (Universal Character Set, 4 bytes) encoding, aka UTF-32 (Unicode Transformation Format, 32 bits).

UTF-8 and variable-length encodings

UCS-4 is simple and straight-forward… but inefficient. Computers send a lot of strings back and forth, and many of those strings use only ASCII characters — characters from the old ASCII character set. One byte — eight bits — is more than enough to store such characters. It is grossly inefficient to use 4 bytes to store an ASCII character.

The key to the solution is to remember that a code point is nothing but a number (an integer). It may be a short number or a long number, but it is only a number. We need just one byte to store the shorter numbers of the Universal Character Set, and we need more bytes only when the numbers get longer. So the solution to our problem is a variable-length encoding.

Specifically, Unicode’s UTF-8 (Unicode Transformation Format, 8 bit) is a variable-length encoding in which each UCS code point is encoded using 1, 2, 3, or 4 bytes, as necessary.

In UTF-8, if the first bit of a byte is a “0”, then the remaining 7 bits of the byte contain one of the 128 original 7-bit ASCII characters. If the first bit of the byte is a “1” then the byte is the first of multiple bytes used to represent the code point, and other bits of the byte carry other information, such as the total number of bytes — 2, or 3, or 4 bytes — that are being used to represent the code point. (For a quick overview of how this works at the bit level, see How does UTF-8 “variable-width encoding” work?)

Just use UTF-8

UTF-8 is a great technology, which is why it has become the de facto standard for encoding Unicode text, and is the most widely-used text encoding in the world. Text strings that use only ASCII characters can be encoded in UTF-8 using only one byte per character, which is very efficient. And if characters — Chinese or Japanese characters, for instance — require multiple bytes, well, UTF-8 can do that, too.

Byte Order Mark

Unicode fixed-length multi-byte encodings such as UTF-16 and UTF-32 store UCS code points (integers) in multi-byte chunks — 2-byte chunks in the case of UTF-16 and 4-byte chunks in the case of UTF-32.

Unfortunately, different computer architectures — basically, different processor chips — use different techniques for storing such multi-byte integers. In “little-endian” computers, the “little” (least significant) byte of a multi-byte integer is stored leftmost. “Big-endian” computers do the reverse; the “big” (most significant) byte is stored leftmost.

  • Intel computers are little-endian.
  • Motorola computers are big-endian.
  • Microsoft Windows was designed around a little-endian architecture — it runs only on little-endian computers or computers running in little-endian mode — which is why Intel hardware and Microsoft software fit together like hand and glove.

Differences in endian-ness can create data-exchange issues between computers. Specifically, the possibility of differences in endian-ness means that if two computers need to exchange a string of text data, and that string is encoded in a Unicode fixed-length multi-byte encoding such as UTF-16 or UTF-32, the string should begin with a Byte Order Mark (or BOM) — a special character at the beginning of the string that indicates the endian-ness of the string.

Strings encoded in UTF-8 don’t require a BOM, so the BOM is basically a non-issue for programmers who use only UTF-8.


Resources

Unicode for dummies — Encoding

Another entry in an irregular series of posts about Unicode.
Typos fixed 2012-02-22. Thanks Anonymous, and Clinton, for reporting the typos.

This is a story about encoding and decoding, with a minor subplot involving Unicode.

As our story begins — on a dark and stormy night, of course — we find our protagonist deep in thought. He is asking himself “What is an encoding?”

What is an encoding?

The basic concepts are simple. First, we start with the idea of a piece of information — a message — that exists in a representation that is understandable (perspicuous) to a human being. I’m going to call that representation “plain text”. For English-language speakers, for example, English words printed on a page, or displayed on a screen, count as plain text.

Next, (for reasons that we won’t explore right now) we need to be able to translate a message in a plain-text representation into some other representation (let’s call that representation the “encoded text”), and we need to be able to translate the encoded text back into plain text. The translation from plain text to encoded text is called “encoding”, and the translation of encoded text back into plain text is called “decoding”.

encoding and decoding

There are three points worth noting about this process.

The first point is that no information can be lost during encoding or decoding. It must be possible for us to send a message on a round-trip journey — from plain text to encoded text, and then back again from encoded text to plain text — and get back exactly the same plain text that we started with. That is why, for instance, we can’t use one natural language (Russian, Chinese, French, Navaho) as an encoding for another natural language (English, Hindi, Swahili). The mappings between natural languages are too loose to guarantee that a piece of information can make the round-trip without losing something in translation.

The requirement for a lossless round-trip means that the mapping between the plain text and the encoded text must be very tight, very exact. And that brings us to the second point.

In order for the mapping between the plain text and the encoded text to be very tight — which is to say: in order for us to be able to specify very precisely how the encoding and decoding processes work — we must specify very precisely what the plain text representation looks like.

Suppose, for example, we say that plain text looks like this: the 26 upper-case letters of the Anglo-American alphabet, plus the space and three punctuation symbols: period (full stop), question mark, and dash (hyphen). This gives us a plain-text alphabet of 30 characters. If we need numbers, we can spell them out, like this: “SIX THOUSAND SEVEN HUNDRED FORTY-THREE”.

On the other hand, we may wish to say that our plain text looks like this: 26 upper-case letters, 26 lower-case letters, 10 numeric digits, the space character, and a dozen types of punctuation marks: period, comma, double-quote, left parenthesis, right parenthesis, and so on. That gives us a plain-text alphabet of 75 characters.

Once we’ve specified exactly what a plain-text representation of a message looks like — a finite sequence of characters from our 30-character alphabet, or perhaps our 75-character alphabet — then we can devise a system (a code) that can reliably encode and decode plain-text messages written in that alphabet. The simplest such system is one in which every character in the plain-text alphabet has one and only one corresponding representation in the encoded text. A familiar example is Morse code, in which “SOS” in plain text corresponds to

                ... --- ...

in encoded text.

In the real world, of course, the selection of characters for the plain-text alphabet is influenced by technological limitations on the encoded text. Suppose we have several available technologies for storing encoded messages: one technology supports an encoded alphabet of 256 characters, another technology supports only 128 encoded characters, and a third technology supports only 64 encoded characters. Naturally, we can make our plain-text alphabet much larger if we know that we can use a technology that supports a larger encoded-text alphabet.

And the reverse is also true. If we know that our plain-text alphabet must be very large, then we know that we must find — or devise — a technology capable of storing a large number of encoded characters.

Which brings us to Unicode.

Unicode

Unicode was devised to be a system capable of storing encoded representations of every plain-text character of every human language that has ever existed. English, French, Spanish. Greek. Arabic. Hindi. Chinese. Assyrian (cuneiform characters).

That’s a lot of characters.

So the first task of the Unicode initiative was simply to list all of those characters, and count them. That’s the first half of Unicode, the Universal Character Set. (And if you really want to “talk Unicode”, don’t call plain-text characters “characters”. Call them “code points”.)

Once you’ve done that, you’ve got to figure out a technology for storing all of the corresponding encoded-text characters. (In Unicode-speak, the encoded-text characters are called “code values”.)

In fact Unicode defines not one but several methods of mapping code points to code values. Each of these methods has its own name. Some of the names start with “UTF”, others start with “UCS”: UTF-8, UTF-16, UTF-32, UCS-2, UCS-4, and so on. The naming convention is “UTF-<number of bits in a code value>” and “UCS-<number of bytes in a code value>” Some (e.g. UCS-4 and UTF-32) are functionally equivalent. See the Wikipedia article on Unicode.

The most important thing about these methods is that some are fixed-width encodings and some are variable-width encodings. The basic idea is that the fixed-width encodings are very long — UCS-4 and UTF-32 are 4 bytes (32 bits) long — long enough to hold the the biggest code value that we will ever need.

In contrast, the variable-width encodings are designed to be short, but expandable. UTF-8, for example, can use as few as 8 bits (one byte) to store Latin and ASCII characters code points. But it also has a sort of “continued on the next byte” mechanism that allows it to use 2 bytes or even 4 bytes if it needs to (as it might, for Chinese characters). For Western programmers, that means that UTF-8 is both efficient and flexible, which is why UTF-8 is the de facto standardard encoding for exchanging Unicode text.

There is, then, no such thing as THE Unicode encoding system or method. There are several encoding methods, and if you want to exchange text with someone, you need explicitly to specify which encoding method you are using.

Is it, say, this.

encoding decoding UTF-8

Or this.

encoding decoding UTF-16

Or something else.

Which brings us back to something I said earlier.

Why encode something in Unicode?

At the beginning of this post I said

We start with the idea of a piece of information — a message — that exists in a representation that is understandable (perspicuous) to a human being.

Next, (for reasons that we won’t explore right now) we need to be able to translate a message in a plain-text representation into some other representation. The translation from plain text to encoded text is called “encoding”, and the translation of encoded text back into plain text is called “decoding”.

OK. So now it is time to explore those reasons. Why might we want to translate a message in a plain-text representation into some other representation?

One reason, of course, is that we want to keep a secret. We want to hide the plain text of our message by encrypting and decrypting it — basically, by keeping the algorithms for encoding and decoding secret and private.

But that is a completely different subject. Right now, we’re not interested in keeping secrets; we’re Python programmers and we’re interested in Unicode. So:

Why — as a Python programmer — would I need to be able to translate a plain-text message into some encoded representation… say, a Unicode representation such as UTF-8?

Suppose you are happily sitting at your PC, working with your favorite text editor, writing the standard Hello World program in Python (specifically, in Python 3+). This single line is your entire program.

                   print("Hello, world!")

Here, “Hello, world!” is plain text. You can see it on your screen. You can read it. You know what it means. It is just a string and you can (if you wish) do standard string-type operations on it, such as taking a substring (a slice).

But now suppose you want to put this string — “Hello, world!” — into a file and save the file on your hard drive. Perhaps you plan to send the file to a friend.

That means that you must eject your poor little string from the warm, friendly, protected home in your Python program, where it exists simply as plain-text characters. You must thrust it into the cold, impersonal, outside world of the file system. And out there it will exist not as characters, but as mere 1’s and 0’s, a jumble of dits and dots, charged and uncharged particles. And that means that your happy little plain-text string must be represented by some specific configuration of 1s and 0s, so that when somebody wants to retrieve that collection of 1s and 0s and convert it back into readable plain text, they can.

The process of converting a plain text into a specific configuration of 1s and 0s is a process of encoding. In order to write a string to a file, you must encode it using some encoding system (such as UTF-8). And to get it back from a file, you must read the file and decode the collection of 1s and 0s back into plain text.

The need to encode/decode strings when writing/reading them from/to files isn’t something new — it is not an additional burden imposed by Python 3’s new support for Unicode. It is something you have always done. But it wasn’t always so obvious. In earlier versions of Python, the encoding scheme was ASCII. And because, in those olden times, ASCII was pretty much the only game in town, you didn’t need to specify that you wanted to write and read your files in ASCII. Python just assumed it by default and did it. But — whether or not you realized it — whenever one of your programs wrote or read strings from a file, Python was busy behind the scene, doing the encoding and decoding for you.

So that’s why you — as a Python programmer — need to be able to encode and decode text into, and out of, UTF-8 (or some other encoding: UTF-16, ASCII, whatever). You need to encode your strings as 1s and 0s so you can put those 1s and 0s into a file and send the file to someone else.

What is plain text?

Earlier, I said that there were three points worth noting about the encoding/decoding process, and I discussed the first two. Here is the third point.

The distinction between plain text and encoded text is relative and context-dependent.

As programmers, we think of plain text as being written text. But it is possible to look at matters differently. For instance, we can think of spoken text as the plain text, and written text as the encoded text. From this perspective, writing is encoded speech. And there are many different encodings for speech as writing. Think of Egyptian hieroglyphics, Mayan hieroglyphics, the Latin alphabet, the Greek alphabet, Arabic, Chinese ideograms, wonderfully flowing Devanagari देवनागरी, sharp pointy cuneiform wedges, even shorthand. These are all written encodings for the spoken word. They are all, as Thomas Hobbes put it, “Marks by which we may remember our thoughts”.

Which reminds us that, in a different context, even speech itself — language — may be regarded as a form of encoding. In much of early modern philosophy (think of Hobbes and Locke) speech (or language) was basically considered to be an encoding of thoughts and ideas. Communication happens when I encode my thought into language and say something — speak to you. You hear the sound of my words and decode it back into ideas. We achieve communication when I successfully transmit a thought from my mind to your mind via language. You understand me when — as a result of my speech — you have the same idea in your mind as I have in mine. (See Ian Hacking, Why Does Language Matter to Philosophy?)

Finally, note that in other contexts, the “plain text” isn’t even text. Where the plain text is soundwaves (e.g. music), it can be encoded as an mp3 file. Where the plain text is an image, it can be encoded as a gif, or png, or jpg file. Where the plain text is a movie, it can be encoded as a wmv file. And so on.

Everywhere, we are surrounded by encoding and decoding.


Notes

I’d like to recommend Eli Bendersky’s recent post on The bytes/str dichotomy in Python 3, which prodded me — finally — to put these thoughts into writing. I especially like this passage in his post.

Think of it this way: a string is an abstract representation of text. A string consists of characters, which are also abstract entities not tied to any particular binary representation. When manipulating strings, we’re living in blissful ignorance. We can split and slice them, concatenate and search inside them. We don’t care how they are represented internally and how many bytes it takes to hold each character in them. We only start caring about this when encoding strings into bytes (for example, in order to send them over a communication channel), or decoding strings from bytes (for the other direction).

I strongly recommend Charles Petzold’s wonderful book Code: The Hidden Language of Computer Hardware and Software.

And finally, I’ve found Stephen Pincock’s Codebreaker: The History of Secret Communications a delightful read. It will tell you, among many other things, how the famous WWII Navaho codetalkers could talk about submarines and dive bombers… despite the fact that there are no Navaho words for “submarine” or “dive bomber”.

Unicode for dummies – just use UTF-8

Revised 2012-03-18 — fixed a bad link, and removed an incorrect statement about the origin of the terms “big-endian” and “little-endian”.

Commenting on my previous post about Unicode, an anonymous commentator noted that

the usage of the BOM [the Unicode Byte Order Mark] with UTF-8 is strongly discouraged and really only a Microsoft-ism. It’s not used on Linux or Macs and just tends to get in the way of things.

So it seems worth-while to talk a bit more about the BOM.  And in the spirit of Beginners Introduction for Dummies Made Simple, let’s begin at the beginning: by distinguishing big and little from left and right.

Big and Little

“Big” in this context means “more significant”. “Little” means “least significant”.

Consider the year of American independence — 1776.  In the number 1776:

  • The least significant (“smallest”) digit is 6. It has the smallest magnitude: it represents 6 * 1, or 6.
  • The most significant (“biggest”) digit is 1. It has the largest magnitude: it represents 1 * 1000, or 1000.

So we say that 1 is located at the big end of 1776 and 6 is located at the small end of 1776.

Left and Right

The terms big-endian and little-endian derive from Jonathan Swift's satirical novel 'Gulliver’s Travels' by way of Danny Cohen in 1980.

The terms big-endian and little-endian derive from Jonathan Swift's satirical novel 'Gulliver’s Travels' by way of Danny Cohen in 1980.

Here are two technical terms: “big endian” and “little endian”.

These terms are derived from “Big End In” and “Little End In.”  According to Wikipedia, the terms Little-Endian and Big-Endian were introduced in 1980 by Danny Cohen in a paper called “On Holy Wars and a Plea for Peace”.

1776 is a “big endian” number because the “biggest” (most significant) digit is stored in the leftmost position. The big end of 1776 is on the left.

Big-endian numbers are familiar.  Our everyday “arabic” numerals are big-endian representations of numbers.  If we used a little-endian representation, the number 1776 would be represented as 6771.  That is, with the “little” end of 1776 — the “smallest” (least significant) digit — in the leftmost position.

What do you think? In Roman numerals, 1776 is represented as MDCCLVI. Are Roman numerals big-endian or little-endian?

So big and little are not the same as left and right.

Byte Order

Now we’re ready to talk about byte order. And specifically, byte-order in computer architectures.

Most computer (hardware) architectures agree on bits (ON and OFF) and bytes (a sequence of 8 bits), and byte-level endian-ness.  (Bytes are big-endian: the leftmost bit of a byte is the biggest.  See Understanding Big and Little Endian Byte Order.)

But problems come up when handling pieces of data, like large numbers and strings, that are stored in multiple bytes.  Different computer architectures use different endian-ness at the level of multi-byte data items (I’ll call them chunks of data).

In the memory of little-endian computers, the “little” end of a data chunk is stored leftmost. This means that, a data chunk whose logical value is 0x12345678 is stored as 4 bytes with the least significant byte to the left, like this: 0x78 0x56 0x34 0x12.

  • For those (like me) who are still operating largely at the dummies level: imagine 1776 being stored in memory as 6771.

Big-endian hardward does the reverse. In the memory of big-endian computers, the “big” end of a data chunk is stored leftmost. This means that a data chunk of 0x12345678 is stored as 4 bytes with the most significant byte to the left, like this: 0x12 0x34 0x56 0x78.

  • For us dummies: imagine 1776 being stored in memory as 1776.

Here are some random (but curiously interesting) bits of information, courtesy of the Microsoft Support web-site article Explanation of Big Endian and Little Endian Architecture.

  • Intel computers are little endian.
  • Motorola computers are big endian.
  • RISC-based MIPS computers and the DEC Alpha computers are configurable for big endian or little endian.
  • Windows NT was designed around a little endian architecture, and runs only on little-endian computers or computers running in little-endian mode.

In summary, the byte order — the order of the bytes in multi-byte chunks of data — is different on big-endian and little-endian computers.

Which brings us to…

The Unicode Byte Order Mark

In this section, I’m going shamelessly to rip off information from Jukka K. Korpela’s outstanding Unicode Explained from O’Reilly (see the section on Byte Order starting on page 300). (See also Jukka’s valuable web page on characters and encodings.)

Suppose you’re running a big-endian computer, and create a file in Unicode’s UTF-16 (two-byte) format.

Note that the encoding is the Unicode UTF-16 (two-byte) encoding, not UTF-8 (one-byte). That’s an important aspect of the problem, as you will see.

You send the file out into the world, and it is downloaded by somebody running a little-endian computer. The recipient knows that the file is in UTF-16 encoding. But the bytes are not in the order that he (with his little-endian computer) expects. The data in the file appears to be scrambled beyond recognition.

The solution, of course, is simply to tell the recipient that the file was encoded in UTF-16 on a big-endian computer.  Ideally, we’d like for the data in the file itself to be able to tell the recipient the byte order (big endian or small endian) that was used when the data was encoded and stored in the file.

This is exactly what the Unicode byte order mark (BOM) is designed to do.

Unicode contains two code points reserved specifically for the purpose of indicating byte order: U+FEFF (big endian) and U+FFFE (little endian).

These code points are used for nothing else than to indicate byte order. If the first two bytes of a file are 0xFEFF or 0xFFFE, then a Unicode decoder knows that those two bytes contain a Unicode BOM, and knows what to do with the BOM.

This also means that if you (in the role, say, of a forensic computer scientist) must process a mystery file, and you see that the file’s first two bytes contain one of the two Unicode BOMs, you can (with a high probability of being correct) infer that the file is encoded in Unicode UTF-16 format.

So: Where’s the BOM?

In actual practice, most UTF-8 files do not include a BOM.  Why not?

A file that has been encoded using UTF-16 is an ordered sequence of 2-byte chunks. Knowing the order of the bytes within the chunks is crucial to being able to decode the file into the correct Unicode code points.  So a BOM should be considered mandatory for files encoded using UTF-16.

But a file in UTF-8 encoding is an ordered sequence of 1-byte chunks.  In UTF-8, a byte and a chunk are essentially the same thing.  So with UTF-8, the problem of knowing the order of the bytes within the chunks is simply a non-issue, and a BOM is pointless. And since the Unicode standard does not require the use of the BOM, virtually nobody puts a BOM in files encoded using UTF-8.

Let’s do UTF-8… all the time!

It is important to recognize that UTF-8 is able to represent any character in the Unicode standard.  So there is a simple rule for coding English text (i.e. text that uses only or mostly ASCII characters) —

Always use UTF-8.

  • UTF-8 is easy to use. You don’t need a BOM.
  • UTF-8 can encode anything.
  • For English or mostly-ASCII text, there is essentially no storage penalty for using UTF-8. (Note, however, that if you’re encoding Chinese text, your mileage will differ!)

What’s not to like!!??

UTF-8? For every Unicode code point?!

How can you possbily encode every character in the entire Unicode character set using only 8 bits!!!!

Here’s where Joel Spolsky’s (Joel on Software) excellent post The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) comes in useful.  As Joel notes

Some people are under the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters. This is not, actually, correct. It is the single most common myth about Unicode, so if you thought that, don’t feel bad.

This is the myth that Unicode is what is known as a Multibyte Character Set (MBCS) or Double-Byte Character Set (DBCS).   Hopefully, by now, this myth is dying.

In fact, UTF-8 is what is known variously as a

  • multibyte encoding
  • variable-width encoding
  • multi-octet encoding (For us dummies, octet == byte. For the difference, see page 46 of Korpela’s Unicode Explained.)

Here’s how multibyte encoding works in UTF-8.

  • ASCII characters are stored in single bytes.
  • Non-ASCII characters are stored in multiple bytes, in a “multibyte sequence”.
  • For non-ASCII characters, the first byte in a multibyte sequence is always in the range 0xC0 to 0xFD. The coding of the first byte indicates how many bytes follow, and so indicates the total number of bytes in the multibyte sequence.
  • In UTF-8, a multibyte sequence can contain as many as four bytes.
  • Originally a multibyte sequence could contain six bytes, but UTF-8 was restricted to four bytes by RFC 3629 in November 2003.

For a quick overview of how this works at the bit level, take a look at the answer by dsimard to the question How does UTF-8 “variable-width encoding” work? on stackoverflow.

Wrapping it all up

So that’s it. Our investigation of the BOM has led us to take a closer look at UTF-8 and multibyte encoding.

And that leads us to a nice place. For the most part, and certainly if you’re working with ASCII data, there is a simple rule.

Just use UTF-8 and forget about the BOM.

Unicode Beginners Introduction for Dummies Made Simple

I’ve been trying to grok Unicode, and it hasn’t been easy.  But today, I finally got it.  And, as it turns out, the basics of Unicode aren’t too difficult.

The problems that I’ve been having turn out not to be with Unicode, but with the introductions that I’ve found.  They’re pretty confusing.  Or maybe I should say, they just don’t fit my brain.   So the logical thing to do, of course, is to write the introduction that I would like to have read.

A lot of what I will write will be shamelessly ripped off from other descriptions that I’ve found useful, including

We start with the observation that “Unicode” is actually two separate and distinct things.  And the first of these things has nothing to do with computers.

Suppose you’re an English orientalist in, say, 1750.  You’ve just discovered Sumerian cuneiform characters from the middle east and Sanskit characters from India.  You get a brilliant idea.  You will make a list of all characters in all languages ever used.  Each will be identified by its own unique number.  So you start out making your list with your own good English characters.  You add in the cuneiform characters and the Sanskrit characters and Greek, Japanese, Chinese, and Korean characters. You add in characters for the funny squiggly/accented/umlauted characters in Spanish, French and German. And so on. And finally you have a very long list of about a zillion characters.

1 a
2 b
3 c
...
26 z
27 A
28 B
...
52 Z
53 (space)
54 ? (question mark)
55 , (comma)
... and so on ...

And (as I say) you did it all with your feather-quill pen. This has nothing to do with computers. It is simply about creating a numbered list of all known characters.

When you finish, you have a complete (you hope) set of characters. So you call it a “character set”. And because you’re in a funny mood, instead of calling the numeric identifiers “numeric identifiers”, you call them “code points”. And because your list is meant to include every character in the known universe, you call it the Universal Character Set, or UCS.

Congratulations! You’ve just invented the first, non-computer, half of Unicode, the Universal Character Set.

Now you borrow Guido’s time machine and fast-forward 260 years to 2010.  Everybody is using computers.  So you have a brilliant idea.  You will find a way for computers to handle your UCS.

Now computers think in 8-bit bytes.  So you think:  we’ll use one byte for each numeric identifier (code point)!  Great idea!  An 8-bit encoding.

The problem of course is that with 8 bits you can make only 256 different bit combinations.  And your list has way more than 256 characters.  So you think: we’ll use two bytes for each character!  Great idea!  A 16-bit encoding.

But there are still problems.  First, even two bytes are not enough to store a number as big as a zillion.  You figure that you’ll need at least 3 bytes to hold the biggest number on your list.  Second, even if you decided to use four bytes (32 bits) for each character, your list might still keep growing and someday even 32-bits might not be enough.  Third, you’re doing mostly English, and 8 bits is plenty for working with English.  So with a 16-bit encoding, you’re using twice as much storage as you really need (and, if you use a 32-bit encoding, you’re using four times as much as you need).

So you think:  Let’s just use an 8-bit encoding, but with a twist.  One of the bit combinations won’t identify a character at all, but will be sort of a continuation sign, saying (in essence) this character identifier is continued on the next several bytes.  So for the most part, you’ll use only one byte per character, but if you need a document to contain some exotic charcters, you can do that.

Congratulations!  You’ve just invented UTF-8 — the 8-bit Unicode Transformation Format, a variable length encoding in which every UCS character (code point) can be encoded in 1 to 4 bytes.

Now you still have one last problem.  You’ve defined both a UTF-8 format and a UTF-16 format.  So you go to open a file and start reading.  You read the first two bytes.  How do you know what you’re reading?  Are the first two bytes two characters in UTF-8 encoding? or a single character in UTF-16 encoding?  What you need is a standard marker at the beginning of files to indicate what encoding the file is in.

Bingo.  You’ve just invented the Byte Order Mark, or BOM (aka “encoding signature”).  The BOM is a two-byte marker at the beginning of a file that tells what encoding the file is using.

So now, when you read a file, you first read the BOM, which tells you what encoding was used to create the file.  This allows you to decode the file into code points (however code points are represented internally in your programming language: Java, Python, whatever).  And when you write out a file, you choose the encoding to be used to encode your Unicode charaters in bits.  You write the BOM, and you write out your Unicode strings.  When you write out the Unicode strings, you specify the encoding to be used when writing the bits and bytes to the file.

And that’s the basics.    In summary,

Unicode =
UCS (definition of a universal character set)
+
UTF (techniques for encoding code points in bit-configurations)

The connection between characters and bit configurations is the numeric character identifier, the “code point”.

     Character set
|------------------------|
                                 Encoding
             |-------------------------------------|
CHARACTER     CODE POINT         BIT CONFIGURATION
=======       ==========         =============
a                  1                 0000
b                  2                 0001
and so on.

There are lots more complicated details of course. But this is the basics.

For a follow-up post, see Unicode for dummies – just use UTF-8.