Using Other Encodings

Unicode in Python: Working With Character Encodings Christopher Trudeau 04:45

In this lesson, you’ll go beyond UTF-8 and learn about other encodings in Python. There are multiple ways of specifying Unicode in a Python string. You’ll learn that not all characters can be represented in all of these formats. The complete list of accepted encodings is buried way down in the documentation for the codecs module, which is part of Python’s Standard Library.

00:00 In the previous lesson, I gave you a tour of useful built-in Python functions for manipulating text and code points. In this lesson, I’m going to talk about encodings other than UTF-8.

00:11 There are numerous ways of specifying Unicode inside of a Python string. You can put it in from your keyboard or paste it from a clipboard. Any string can contain Unicode. You can use a raw octal escape specifying a 3-digit long octal number, a raw hex escape specifying a 2-digit long hex number.

00:32 You can use the full Unicode database name, or you can use the small "\u" escape, which is 2 hex bytes, or the full-size 4-byte capital "\U" escape.

00:45 And now inside the REPL, I’ll prove that all those things are the same. The typed 'a', the octal, the hex, the database,

01:00 small "\u" escape, capital "\U" escape—

01:06 and look at that. They’re all equal. It’s not possible to represent all Unicode characters using all of these escape sequences. An octal escape is always 3 digits long. That gives it a maximum value of 511 in decimal, or code point 1FF.

01:23 A hex escape is always 2 digits long. That gives it a maximum decimal value of 255, or code point FF. The small "\u" escape is 4 digits of hex.

01:36 This allows you to get up to decimal 65535, which actually isn’t a character. This is a reserved spot in Unicode for the symbol <not a character>.

01:46 This means escape capital "\U" is the only format that can specify all possible code points. In addition to UTF-8, Unicode supports UTF-16 and -32.

01:57 Like UTF-8, UTF-16 is variable-length, but it’s either 2 or 4 bytes. UTF-32 is always 4 bytes long. It’s important to note that these encodings are not compatible with each other.

02:12 Consider the following code.

02:16 Encoding the raw data inside of "utf-8" then decoding it in "utf-16" does not give you the same result. Not all UTF-8 encodings are even compatible with UTF-16 encodings, so not only is it possible to get the wrong result—you may also get an exception.