Locked learning resources

Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Locked learning resources

This lesson is for members only. Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Using Unicode

Python 3 stores data as either a string or a byte. In this lesson, you’ll practice with encode() and decode(), which allow you to convert between the two. You’ll also start to work with Unicode and learn the difference between an encoding and a code point. Unicode specifies code points for characters but not their encodings. There are several different ways of encoding Unicode. UTF-8 is the most common and is the default in Python 3.

00:00 In the previous lesson, I showed you how to use hexadecimal notation to represent binary in bits. In this lesson, I’m finally going to start talking about Unicode.

00:09 You may recall the ASCII standard has 128 code points. This is not enough for all human languages. This has been improved upon: Unicode has 1,114,112 possible code points. That’s 17 * 2^16 - 1, or 0 to hex 10FFFF.

00:31 That’s a bit of a lie? There are a couple reserved spots inside of Unicode, so the full range isn’t actually available, but the number’s close enough between friends.

00:41 The first 128 code points in Unicode are ASCII, making it backwardly compatible. Unicode itself is not an encoding. Unicode really only specifies the map for the code points. UTF-8, the basis for Python, is the most common and popular of the encodings.

01:00 There’s also UTF-16, -32, and actually several others.

01:05 Python 3 has two representations of data: str (string) and bytes. bytes are straight binary, str are text.

01:13 The concept of encoding and decoding in Python is the process of moving between these two representations. Let’s start off with a sample encoding. Here’s good old 'hello' encoded into 'utf-8', and you get back the binary representation b'hello'. For ASCII, it’s pretty simple—there’s not much change.

01:32 Let’s look at something a little more complicated. 'café', encoded—'caf'still ASCII—but 'é', l’accent aigu, is not inside the ASCII table.

01:43 UTF-8 represents this letter using 0xc30xa9 (c3 a9 hex). To highlight that, let’s look at it on its own. Same result: Charlie 3 Alpha 9. The UTF-8 encoding is actually variable-length.

02:01 Anything in the ASCII table is represented by a single byte. The 'é', in 'café', is 2 bytes.

02:10 The Euro symbol is actually 3 bytes encoded. UTF-8 can go all the way up to 4 bytes encoded.

02:18 The .decode() method does this backwards. So, by starting with 'caf\xc3\xa9', .decode(), and you get back to the original string.

02:32 Notice that you’re feeding this .decode() method a binary representation and it’s returning the str result—the opposite of the .encode() method.

02:41 Later versions of Python 2 did support Unicode, but they did so by adding a layer on top of ASCII and it caused all sorts of complications. Python 3 defaults to Unicode and UTF-8.

02:51 This is actually a big win. I used to do a lot of Django programming in bilingual websites, and having to play with Python 2 and Unicode and getting all the accents right was problematic.

03:03 This being the default means all strings are Unicode and they can contain any Unicode character. Most Unicode is even valid for identifiers, so if I felt like embracing my French-Canadian heritage and putting the appropriate accents inside of résumé, I could. Generally, it’s not considered good practice because these characters aren’t always easy to type on most people’s keyboards, but it is now possible.

03:29 Not all characters in Unicode are valid identifiers. Unfortunately, you cannot use emojis inside your identifiers. The list is long of what is supported and it supports most languages, but it isn’t 100% of Unicode. In addition to string manipulation being Unicode-based, so are regular expressions. And finally, the default encoding for str.encode() is "utf-8", so if you don’t specify it, it’ll be UTF-8.

03:55 That being said, best practice is to always specify the encoder. This makes it easier for people who are switching back and forth between Python 2 in Python 3 not to get confused. Now that you’ve seen the basics behind Unicode and UTF-8, in the next lesson, I’m going to show you how UTF-8 actually works.

Become a Member to join the conversation.