Using Unicode

Unicode in Python: Working With Character Encodings Christopher Trudeau 04:15

Python 3 stores data as either a string or a byte. In this lesson, you’ll practice with encode() and decode(), which allow you to convert between the two. You’ll also start to work with Unicode and learn the difference between an encoding and a code point. Unicode specifies code points for characters but not their encodings. There are several different ways of encoding Unicode. UTF-8 is the most common and is the default in Python 3.

00:00 In the previous lesson, I showed you how to use hexadecimal notation to represent binary in bits. In this lesson, I’m finally going to start talking about Unicode.

00:09 You may recall the ASCII standard has 128 code points. This is not enough for all human languages. This has been improved upon: Unicode has 1,114,112 possible code points. That’s 17 * 2^16 - 1, or 0 to hex 10FFFF.

00:31 That’s a bit of a lie? There are a couple reserved spots inside of Unicode, so the full range isn’t actually available, but the number’s close enough between friends.

00:41 The first 128 code points in Unicode are ASCII, making it backwardly compatible. Unicode itself is not an encoding. Unicode really only specifies the map for the code points. UTF-8, the basis for Python, is the most common and popular of the encodings.

01:00 There’s also UTF-16, -32, and actually several others.

01:05 Python 3 has two representations of data: str (string) and bytes. bytes are straight binary, str are text.

01:13 The concept of encoding and decoding in Python is the process of moving between these two representations. Let’s start off with a sample encoding. Here’s good old 'hello' encoded into 'utf-8', and you get back the binary representation b'hello'. For ASCII, it’s pretty simple—there’s not much change.

01:32 Let’s look at something a little more complicated. 'café', encoded—'caf'—still ASCII—but 'é', l’accent aigu, is not inside the ASCII table.

01:43 UTF-8 represents this letter using 0xc30xa9 (c3 a9 hex). To highlight that, let’s look at it on its own. Same result: Charlie 3 Alpha 9. The UTF-8 encoding is actually variable-length.

02:01 Anything in the ASCII table is represented by a single byte. The 'é', in 'café', is 2 bytes.

02:10 The Euro symbol is actually 3 bytes encoded. UTF-8 can go all the way up to 4 bytes encoded.

02:18 The .decode() method does this backwards. So, by starting with 'caf\xc3\xa9', .decode(), and you get back to the original string.