Unicode, UTF-8 explained with examples using Go
Unicode and utf-8 are two topics that I’ve always had a lot of trouble wrapping my head around. Although I can memorize specifying encoding scheme to utf-8 when reading to and writing from files, it always seemed like a black box. And whenever I try to read something about it, it always pointed to references that were way too brief or way too complicated. Therefore, I hope that this article will be a good stepping stone for anyone who’d want to have a working knowledge in text encoding.
I have segregated this article into the below sub topics.
- ASCII standard
- ASCII based other schemes and the problems that they presented
- The need for Unicode
- Unicode standard
- UTF-8, UTF-16 and UTF-32
- UTF-8 explained further
- Examples in Go
ASCII standard
In short, character encoding refers to the process of converting characters into their binary representations that the computer can understand. The mapping of how this is done is referred as an encoding scheme.
The grandfather of these encoding schemes are known as ASCII (American standard code for information interchange). This is a 7 bit encoding scheme that was developed and made famous by Bell labs during the 70s. During that time, programs only required lower-cased and upper-cased English alphabet letters with numbers, punctuation, control characters and few other special characters. Since this is a 7 bit mapping, only 2⁷ == 128 characters were supported by it.
ASCII based other schemes
Since most languages at that time used 8 bits to save a character, one bit was always left to waste. Meaning that, 2⁸ — 2⁷ = 128 more characters could have been mapped to something.
Hence, many companies came up with their own encoding schemes based on ASCII, which were sometimes known as extended ASCII, which includes 256 different characters. For example, Latin-1 encoding scheme extends ASCII to support most western European languages. Windows supported an encoding scheme known as CP-1252 (commonly known as ANSI) while Mac-OS supported a scheme known as Mac-OS Roman .
If you compare the above two character sets, you could see that they share common characters. However, these characters are encoded differently in each scheme.
Meanwhile, countries with a vast range of characters like Japan, China and Korea came up with their own character sets which didn’t include ASCII as well.
The need for Unicode
Now let’s understand why we needed a universal character encoding. Imagine you write to a text file in a Windows operating system which supports CP-1252 encoding scheme and it includes few Greek characters. Let’s say, that you read this file in a Mac OS system using its default encoding scheme Mac-OS Roman. The text you have written would have a different meaning since non English characters would be mapped to undesired other characters. This example seemed a little far fetched at that time and there were ways to get around it by remembering to read and write using a common scheme.
However, during the 90s, Internet started to get traction and suddenly you have documents that were written from different schemes all over the world being shared everywhere. This heavily, emphasized the need for a unified standard.
Unicode standard
In the early 90s, Unicode standard was born, which provides a consistent encoding, representation, and handling of text expressed in most of the world’s writing systems. The standard is maintained by a society known as the Unicode Consortium, and as of March 2020, there is a repertoire of 143,859 characters, with Unicode 13.0. These characters consist of 143,696 graphic characters and 163 format characters covering 154 modern and historic scripts, as well as multiple symbol sets and emojis.
Unicode uses a simple unsigned (positive) integer based encoding scheme. Each character is given an integer value which ranges from 0 to 1,114,111. These integers are known as code points. First few characters of the Unicode standard is as same as the ASCII standard (ex. A is 65). And each added new character is given an incremented code point value. Code points of few characters from Hindi language is given below.
ऒ — 2322, ओ — 2323, औ — 2324, क — 2325
Typically, a character set of a non-English language would follow its alphabet accordingly.
Utf-8, UTF-16 and utf-32
If there’s only one thing that you want to take away from this article, then it should be the below.
Unicode doesn’t specify how the code points should be encoded (conversion to bits). It is independent of any binary representations as they are simply just numbers.
It just simply gives a code point to each character. How these code points are saved in memory or written to disk is entirely dependent on the end user. This is the reason why we have many encoding schemes for Unicode. Let’s look at UTF-32 first, as it’s more intuitive.
UTF-32
UTF-32 simply uses 4 bytes save each character. Which means that 2³²= 4,294, 967,296 number of Unicode characters (think of characters as code points) can be encoded. However the drawback of this scheme is that essentially each character would take 4 bytes of memory. This can be a big waste since a document only containing English characters encoded with UTF-32 would take up to 4 times more space compared to that same document encoded with ASCII, since ASCII occupies only 1 byte. So, let’s see if we can do better.
UTF-16
UTF-16 uses 2 bytes to save each character in Unicode standard. For characters given in the range of hexadecimal (ux0000 — uxFFFF), each code point is stored directly in 16-bits. Moreover, for the code points that do not fit in the range of 16 bits (greater than 2¹⁶ == 65536) uses two 16-bit code points with a method called surrogates, which I’d not dive-in since It’s a complicated process. However, UTF-16 would also require 2 bytes to store ASCII characters, and this leaves more room for improvement.
UTF-8
UTF-8 is the last piece of the puzzle and it is known as one of the best hacks of the Internet Era. Even though the name suggests that UTF-8 might use just 1 byte (8 bits), this is not entirely true. UTF-8 uses something called, a variable length encoding scheme.
For characters ranging from 0 to 127 (0x0 to 0x7F), it uses only 8 bits and the first bit is left as 0. This is an exact representation of the ASCII standard. Therefore, UTF-8 is known to be backward compatible with the ASCII standard. Meaning that, a document encoded with ASCII standard would be easily read using UTF-8 and vise versa provided that it only includes ASCII characters.
For characters ranging from 128 to 2047 (0x80 to 0x7FF), it uses 2 bytes for encoding. This encoding is done in a special way. Let’s take character Ѱ (code point = 1136) for example. This would be represented in binary as below,
Ѱ = 1136 = U+0470 = 11010001 | 10110000
In the example above, the italic letters represents the actual binary conversion of the integer code point. In the first byte, the first 3 bits will always be 110 to represent the character as a 2 byte encoded Unicode code point, while in the second byte first 2 bits will always be 10 to represent it as a continuation byte. So, technically only 11 bits from the 16 bits are actually used to represent a code point.
A clear representation of this, is given in the table below. You can also see how 3 bytes and 4 bytes characters are also encoded.
For extra clarity, let’s pick a 3 byte character and see it’s binary representation. Let’s examine the roman numeral Ⅲ’s UTF-8 representation.
Ѱ = 8546 = U+2162 = 11100010 | 10000101 | 10100010
The only difference from 2 bytes is that, this time it uses 3 bytes and in the first byte, first 4 bits will always be 1110 and in the subsequent bytes, the first two bits will always be 10. The rest of the bits are used for the actual binary representation.
Examples in Go
Now that we have a brief understanding of Unicode and the various encoding schemes, let’s look at few examples. I’ve used Go as the programming language since Go natively supports Unicode and UTF-8. Most probably due to the fact that Go co-founders Rob Pike and Ken Thompson also developed UTF-8. And also Go playground is a great way for you to get a bit of practice in Unicode since it doesn’t require any setups.
Go playground: https://play.golang.org/
In Go, characters are represented as a datatype called runes, which is an alias for Int32 data type. Without going much into Go specific details, you could write a character variable with the below syntax.
Now, let’s look at how characters are represented in string format, byte array and in the binary format. We can use different formatting verbs which were inherited from the C language.
You can see how the byte array keep increasing when presented by a character with a higher code point value. Let’s see how we can print a list Sinhalese characters (my mother tongue)if we start from a base Sinhalese character.
The above code first converts the character to it’s decimal representation and prints the code point value (3461). Then, using that as a base code point, it has printed the next 20, code point values.
The full code is shared in the above go playground link.
Conclusion
With 4 bytes UTF-8 can use up to 21 bits to store any possible code points. This is enough for all the characters in existence now. And if there are more characters in the future, then UTF-8 could easily be extended into 5, 6 bytes and so on. The elegance of this solution is that, it solves the problem of needing extra bytes to store every day ASCII characters and it also is backward compatible with ASCII. Which makes UTF-8 very efficient for ASCII text as well as fairly efficient for European, Middle Eastern and few other scripts.
UTF-8 has become the de-facto standard for encoding Unicode text. It has become the default encoding on many operating systems and has also become the preferred encoding for the web and for many data formats such as XML, HTML, CSS, JSON etc.
I hope this article managed to shed some light on Unicode, UTF-8 and text encoding in general. If you want to further dive in to theses topics, I’d recommend the below resources.
- https://home.unicode.org/
- http://www.joelonsoftware.com/articles/Unicode.html
- https://blog.golang.org/strings
- https://www.youtube.com/watch?v=MijmeoH9LT4
- https://www.youtube.com/watch?v=I-pQH_krD0M
- https://www.youtube.com/watch?v=HhUuzFXdyNs&t=38s
Happy learning!!