Combining Characters

Unicode in Python: Working With Character Encodings Christopher Trudeau 05:40

In this lesson, you’ll learn about digraphs, ligatures, and accent symbols. The Unicode standard allows for the combination of characters into a single glyph. This is used for applying accents to characters and changing the skin tone of an emoji. Some kinds of common digraphs, or letter combinations, have their own code point. Others require the combination of two code points, with one dedicated to the accent symbol.

00:00 In the previous lesson, I took you deep into the guts of how UTF-8 encodes at the binary level. In this lesson, I’m going to be talking about digraphs and other ways of combining characters in Unicode. So, what’s a digraph?

00:12 A digraph is a pair of letters that make a single sound. This is similar to a ligature, which is the name for the actual combination of the glyphs to make the character.

00:20 Old English had a bunch of these inherited from Latin—you may have seen them before. The word archeology originally contained the grapheme “ash”, which is the combination of the a and the e. Some ligatures like these are single code points in Unicode. Others are combinations.

00:37 The 'æ' “ash” combination is a single character. It has a code point of E6. Hindi and many of the languages on the Indian subcontinent use Devangari as their script.

00:50 The symbol for “ni” is one of these kinds of ligatures. It is a combination of two code points: code point 928 and 93F. You can see this in practice inside the REPL.

01:01 Starting with the code point for 'æ', this is a single code point with a single letter. This is the symbol for the Devangari sound “na”, which is part of the combined symbol for the sound “ni”. The 928 code point can be used on its own.

01:19 The 93f code point is not able to be used on its own. If you examine just this code point, you’ll notice that there’s a little dotted circle.

01:28 That dotted circle is the placeholder for the character that it’s being combined to. This ability to make combinations is also used to adjust emojis. The original emojis were all Simpsons-esque yellow.

01:42 Using code points, you can combine these to change the skin tone. 1F3FB is the lightest color possible.

02:00 By working your way up, you can continue to make the skin tone darker until you reach 1F3FF. This allows our Vulcan salute to be multicultural.

02:12 A quick note on terminology. Although for this course I’ve strictly defined what a character is, using that word amongst non-programmers is going to cause some confusion.

02:22 If you speak to somebody like a graphic designer or a typographer, they may use the word character loosely to mean the symbol that shows up on the screen. When you’re combining characters in ligatures and using digraphs, this may not actually represent a single code point. As a word of caution, be careful when you’re talking to somebody about how this works.