Locked learning resources

Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Locked learning resources

This lesson is for members only. Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Encoding UTF-8

In this lesson, you’ll learn about a crucial feature of UTF-8 called variable-length encoding. The UTF-8 encoding of Unicode doesn’t just store the code point. You’ll write a hex-based code point viewer to help visualize this. Using this function, you’ll try different characters and see the 1-, 2-, 3- and 4-byte encodings that UTF-8 uses.

00:00 In the previous lesson, I showed you how .encode() and .decode() works in Python to move from strings to bytes, and back. In this lesson, I’m going to drill down on UTF-8 and how it actually stores the content.

00:14 Remember that Unicode specifies the code point, whereas UTF-8 is an encoding storing those values. Python has two escape characters you can use to get at the Unicode code points: "\u" and capital "\U".

00:27 Small "\u" is used for 4-digit hexadecimal code points, capital "\U" is for 8-digit hexadecimal code points. The purpose of this lesson is to fulfill your curiosity about UTF-8. Generally, you don’t need to understand the inner workings of this to be able to successfully use UTF-8 and Unicode in Python.

00:48 Now that you’re familiar with hex, I’ve rewritten the method that shows the code points, this time showing it in hex code points. I’ve put this inside of a file called points.py.

01:01 I can import this function, and then write a string… and look at the encoding. The 'c' in 'café' is hex 63, the 'a' is hex 61, the 'f'66, and 'é' accent aigu is e9.

01:20 Notice that these are the code points, not how UTF-8 stores them.

01:27 You can use the "\u" to get those letters back out. So, I can replace place with 'caf\u00E9' and get back the exact same string.

01:43 See? No different.

01:49 Letter by letter, same thing.

01:55 Now, when you encode this, notice that the code point 'E9' turns into 0xc30xa9 (c3 a9)—2 bytes of hex.

02:07 The double dagger symbol is a much larger code point number in Unicode.

02:15 Encoding it turns it into 3 bytes worth of information.

02:23 The snake is close to the upper end of the table—

02:28 you need a full 8 digits to describe the code point.

02:34 Encoding that turns it into 4 bytes of UTF-8. You’ve gone from single letters in ASCII that are stored in a single byte, upper-level extended ASCII characters that are stored in 2 bytes, higher-level characters in 3, and then things like the snake symbol way up at the top of the table, requiring a full 4 bytes of UTF-8.

02:58 So, I think I’ve established the UTF-8 is an encoding and not just the Unicode code point number. It’s variable-length and can be 1, 2, 3, or 4 bytes long.

03:08 How does the system understand what a character is comprised of? How does it know how many bytes are in this UTF-8 character? Well, the secret is in the encoding.

03:19 The beginning of each encoded character has the first few bits of a byte indicate how long the sequences is. If it’s 1 byte, the leading bit is 0.

03:30 The remaining bits are the actual encoding. This corresponds perfectly to the 7-bit ASCII. For 2-byte encodings, the leading bits are 110. The remaining chunk, then, is part of the encoding. Back to our 'é' from 'cafe', C3 starts with 110, so you can see by looking at the first byte that this is going to be a 2-byte encoding. 3 bytes starts with 1110.

03:58 Similarly for the double dagger—that encodes to E2. E2 starts with 1110. And finally, for 4-byte encodings—four 1’s and a 0. The pattern holds.

04:12 So, what about the rest of the bytes? Well, if you’re in a 2-, 3- or 4-byte encoding, the 2nd, 3rd or 4th byte all start with 10. This is important.

04:23 This feature is called self punctuating. This means you can look at any byte in Unicode and know whether or not it’s a leading byte or a subsequent byte. No leading byte starts with 10.

04:35 This allows you to pick up partway through a stream and know when the next character starts. To see this in action, let’s look back at the 'é' from 'café'.

04:46 Remember? That’s code point E9. E9 is greater than 7F. 7F is 127 in decimal. This means it’s going to have to be a multi-byte encoding, so it won’t start with 0, like an ASCII one. To start to break this down, let’s look at E9 as a binary number.

05:06 Using our digits of hex trick from before, translate the E into 1110 and the 9 into 1001. Because the number in the code point is bigger than 127, you know that it’s going to be multi-byte encoding.

05:21 Start on the right-hand side and peel off the last 6 bits. Because this is going to be a subsequent encoded byte, lead it with the 10 subsequent byte marker.

05:34 Next, take the next chunk of bits. Well, there’s only 2 bits left and because all the bits have been used up, you know you’re done, which means it’s going to be 2 bytes, so use the 2-byte marker.

05:45 Finally, fill in the middle with some padding. This is the end result of the encoding. The left-hand side turns into C3, the right-hand side into A9.

05:57 If you remember from the session in the REPL, letter—having the code point E9—encoded into \xc3\xa9 (c3 a9).

06:06 So, this is how UTF-8 represents its information. Believe it or not, that was the easy part. It gets worse from here on in. Next up: digraphs and dirty tricks.

Become a Member to join the conversation.