Misconceptions about Unicode and UTF-8/16/32

February 20, 2009

So everyone knows that UTF-8 encodes Unicode code points into eight bits, and UTF-16 encodes Unicode code points into 16 bits (or, two bytes), right? I mean…"UTF-8″ has the answer right there in the name, doesn’t it?

Nope.

unicode.org

The chart above reveals a misconception about Unicode: the "8" in "UTF-8" doesn’t actually indicate how many bits a code point gets encoded into. The final size of the encoded data is based on two things: a) the code unit size, and b) the number of code units used. So the 8 in UTF-8 stands for the code unit size, not the number of bits that will be used to encode a code point.

unicode.org

As the chart indicates, UTF-8 can actually store a code point using between one and four bytes. I find it helpful to think of the code unit size as the "granularity level", or the "building block size" you have available to you. So with UTF-16 you can still only have four bytes maximum, but your code unit size is 16 bits, so your minimum number of bytes is two.

[ If this interests you, please take a look at my full article on character mapping and encoding at: danielmiessler.com/study/encoding/ > ]