Unicode for dummies – just use UTF-8

Revised 2012-03-18 — fixed a bad link, and removed an incorrect statement about the origin of the terms “big-endian” and “little-endian”.

Commenting on my previous post about Unicode, an anonymous commentator noted that

the usage of the BOM [the Unicode Byte Order Mark] with UTF-8 is strongly discouraged and really only a Microsoft-ism. It’s not used on Linux or Macs and just tends to get in the way of things.

So it seems worth-while to talk a bit more about the BOM.  And in the spirit of Beginners Introduction for Dummies Made Simple, let’s begin at the beginning: by distinguishing big and little from left and right.

Big and Little

“Big” in this context means “more significant”. “Little” means “least significant”.

Consider the year of American independence — 1776.  In the number 1776:

  • The least significant (“smallest”) digit is 6. It has the smallest magnitude: it represents 6 * 1, or 6.
  • The most significant (“biggest”) digit is 1. It has the largest magnitude: it represents 1 * 1000, or 1000.

So we say that 1 is located at the big end of 1776 and 6 is located at the small end of 1776.

Left and Right

The terms big-endian and little-endian derive from Jonathan Swift's satirical novel 'Gulliver’s Travels' by way of Danny Cohen in 1980.

The terms big-endian and little-endian derive from Jonathan Swift's satirical novel 'Gulliver’s Travels' by way of Danny Cohen in 1980.

Here are two technical terms: “big endian” and “little endian”.

These terms are derived from “Big End In” and “Little End In.”  According to Wikipedia, the terms Little-Endian and Big-Endian were introduced in 1980 by Danny Cohen in a paper called “On Holy Wars and a Plea for Peace”.

1776 is a “big endian” number because the “biggest” (most significant) digit is stored in the leftmost position. The big end of 1776 is on the left.

Big-endian numbers are familiar.  Our everyday “arabic” numerals are big-endian representations of numbers.  If we used a little-endian representation, the number 1776 would be represented as 6771.  That is, with the “little” end of 1776 — the “smallest” (least significant) digit — in the leftmost position.

What do you think? In Roman numerals, 1776 is represented as MDCCLVI. Are Roman numerals big-endian or little-endian?

So big and little are not the same as left and right.

Byte Order

Now we’re ready to talk about byte order. And specifically, byte-order in computer architectures.

Most computer (hardware) architectures agree on bits (ON and OFF) and bytes (a sequence of 8 bits), and byte-level endian-ness.  (Bytes are big-endian: the leftmost bit of a byte is the biggest.  See Understanding Big and Little Endian Byte Order.)

But problems come up when handling pieces of data, like large numbers and strings, that are stored in multiple bytes.  Different computer architectures use different endian-ness at the level of multi-byte data items (I’ll call them chunks of data).

In the memory of little-endian computers, the “little” end of a data chunk is stored leftmost. This means that, a data chunk whose logical value is 0x12345678 is stored as 4 bytes with the least significant byte to the left, like this: 0x78 0x56 0x34 0x12.

  • For those (like me) who are still operating largely at the dummies level: imagine 1776 being stored in memory as 6771.

Big-endian hardward does the reverse. In the memory of big-endian computers, the “big” end of a data chunk is stored leftmost. This means that a data chunk of 0x12345678 is stored as 4 bytes with the most significant byte to the left, like this: 0x12 0x34 0x56 0x78.

  • For us dummies: imagine 1776 being stored in memory as 1776.

Here are some random (but curiously interesting) bits of information, courtesy of the Microsoft Support web-site article Explanation of Big Endian and Little Endian Architecture.

  • Intel computers are little endian.
  • Motorola computers are big endian.
  • RISC-based MIPS computers and the DEC Alpha computers are configurable for big endian or little endian.
  • Windows NT was designed around a little endian architecture, and runs only on little-endian computers or computers running in little-endian mode.

In summary, the byte order — the order of the bytes in multi-byte chunks of data — is different on big-endian and little-endian computers.

Which brings us to…

The Unicode Byte Order Mark

In this section, I’m going shamelessly to rip off information from Jukka K. Korpela’s outstanding Unicode Explained from O’Reilly (see the section on Byte Order starting on page 300). (See also Jukka’s valuable web page on characters and encodings.)

Suppose you’re running a big-endian computer, and create a file in Unicode’s UTF-16 (two-byte) format.

Note that the encoding is the Unicode UTF-16 (two-byte) encoding, not UTF-8 (one-byte). That’s an important aspect of the problem, as you will see.

You send the file out into the world, and it is downloaded by somebody running a little-endian computer. The recipient knows that the file is in UTF-16 encoding. But the bytes are not in the order that he (with his little-endian computer) expects. The data in the file appears to be scrambled beyond recognition.

The solution, of course, is simply to tell the recipient that the file was encoded in UTF-16 on a big-endian computer.  Ideally, we’d like for the data in the file itself to be able to tell the recipient the byte order (big endian or small endian) that was used when the data was encoded and stored in the file.

This is exactly what the Unicode byte order mark (BOM) is designed to do.

Unicode contains two code points reserved specifically for the purpose of indicating byte order: U+FEFF (big endian) and U+FFFE (little endian).

These code points are used for nothing else than to indicate byte order. If the first two bytes of a file are 0xFEFF or 0xFFFE, then a Unicode decoder knows that those two bytes contain a Unicode BOM, and knows what to do with the BOM.

This also means that if you (in the role, say, of a forensic computer scientist) must process a mystery file, and you see that the file’s first two bytes contain one of the two Unicode BOMs, you can (with a high probability of being correct) infer that the file is encoded in Unicode UTF-16 format.

So: Where’s the BOM?

In actual practice, most UTF-8 files do not include a BOM.  Why not?

A file that has been encoded using UTF-16 is an ordered sequence of 2-byte chunks. Knowing the order of the bytes within the chunks is crucial to being able to decode the file into the correct Unicode code points.  So a BOM should be considered mandatory for files encoded using UTF-16.

But a file in UTF-8 encoding is an ordered sequence of 1-byte chunks.  In UTF-8, a byte and a chunk are essentially the same thing.  So with UTF-8, the problem of knowing the order of the bytes within the chunks is simply a non-issue, and a BOM is pointless. And since the Unicode standard does not require the use of the BOM, virtually nobody puts a BOM in files encoded using UTF-8.

Let’s do UTF-8… all the time!

It is important to recognize that UTF-8 is able to represent any character in the Unicode standard.  So there is a simple rule for coding English text (i.e. text that uses only or mostly ASCII characters) —

Always use UTF-8.

  • UTF-8 is easy to use. You don’t need a BOM.
  • UTF-8 can encode anything.
  • For English or mostly-ASCII text, there is essentially no storage penalty for using UTF-8. (Note, however, that if you’re encoding Chinese text, your mileage will differ!)

What’s not to like!!??

UTF-8? For every Unicode code point?!

How can you possbily encode every character in the entire Unicode character set using only 8 bits!!!!

Here’s where Joel Spolsky’s (Joel on Software) excellent post The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) comes in useful.  As Joel notes

Some people are under the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters. This is not, actually, correct. It is the single most common myth about Unicode, so if you thought that, don’t feel bad.

This is the myth that Unicode is what is known as a Multibyte Character Set (MBCS) or Double-Byte Character Set (DBCS).   Hopefully, by now, this myth is dying.

In fact, UTF-8 is what is known variously as a

  • multibyte encoding
  • variable-width encoding
  • multi-octet encoding (For us dummies, octet == byte. For the difference, see page 46 of Korpela’s Unicode Explained.)

Here’s how multibyte encoding works in UTF-8.

  • ASCII characters are stored in single bytes.
  • Non-ASCII characters are stored in multiple bytes, in a “multibyte sequence”.
  • For non-ASCII characters, the first byte in a multibyte sequence is always in the range 0xC0 to 0xFD. The coding of the first byte indicates how many bytes follow, and so indicates the total number of bytes in the multibyte sequence.
  • In UTF-8, a multibyte sequence can contain as many as four bytes.
  • Originally a multibyte sequence could contain six bytes, but UTF-8 was restricted to four bytes by RFC 3629 in November 2003.

For a quick overview of how this works at the bit level, take a look at the answer by dsimard to the question How does UTF-8 “variable-width encoding” work? on stackoverflow.

Wrapping it all up

So that’s it. Our investigation of the BOM has led us to take a closer look at UTF-8 and multibyte encoding.

And that leads us to a nice place. For the most part, and certainly if you’re working with ASCII data, there is a simple rule.

Just use UTF-8 and forget about the BOM.

This entry was posted in Unicode. Bookmark the permalink.

9 Responses to Unicode for dummies – just use UTF-8

  1. The “Just use UTF-8 and forget about the BOM” is a sound advice, one with which I couldn’t agree more. However, this leaves millions of MS Windows users on brittle ice. Which version of Windows supports correctly UTF-8 as the default “codepage” for the system (and accordingly the “chcp 65001” thing for the command prompt ecosystem)?

  2. flow says:

    you are perfectly right that character encoding schemes often feel, and act, like ‘brittle ice’. however, i want to promote the idea that especially in an environment like ms windows, with all the legacy stuff lying around, where several layers of the system apply differing heuristics to render text and rely on differing assumptions, trying to ‘max out on utf-8’ is the only thing that will work in the long run. this strategy means (eg in a web application) that when tests or actual usage reveal an encoding error, go through the entire tool stack — os, web server, database, everything — and re-affirm that yes, i want to use utf-8 here. make sure the web page comes out with a meta header stating utf-8, and that the http response header also states utf-8. make sure your source files are all in utf-8, and that your editor / ide also uses that. ah, and for filenames: [-a-z0-8_.], except for tested exceptions that work. this last one is mainly intended for when you want to package your sources in a *.tgz or *.zip and unpack it on another platform; this will often lead to funny results when you go beyond 7-bit us ascii.

  3. Paul says:

    Its a shame that the BOM is frowned upon when a file is encoded UTF8, since it provides a useful signature check.

  4. Simon Hibbs says:

    Quote: These terms are derived from “Big End In” and “Little End In.”

    I’ve seen that theory posited before, but Danny Cohen explicitly refers to Guliver in his article, so it’s pretty clear his usage of the terms derives from Swift and the Lilliputians dispute as to which end of an egg should be eaten first.

  5. Walter Dörwald says:

    Also see http://docs.python.org/library/codecs.html#encodings-and-unicode for an explaination of UTF-8 and the BOM

  6. costy says:

    Hello! I’m at work surfing around your blog from my new iphone 3gs! Just wanted to say I love reading through your blog and look forward to all your posts! Carry on the great work!

  7. Pim says:

    I agree with Paul that the BOM is a useful signature check for UTF-8 files.
    In fact, it’s more useful for UTF-8 than for UTF-16 or UTF-32. If a computer program has to guess what encoding a BOM-less file is in, checking the first four bytes will provide enough information in the majority of cases. But not for UTF-8! UTF-8 will look just like any 1-byte charset in the absense of a BOM.

  8. Aditya says:

    Nice article,lucid ,concise and to the point.Though a small paragraph on why UTF-16 did come up or why someone/some people felt the need for UTF-16 would make “always using UTF-8” an even better “informed” decision. Juxtapose them ,I feel it may help. Anyways it’s just a suggestion 🙂

Comments are closed.