Plain Matching and Class Matching

Christopher Trudeau

Regular Expressions and Building Regexes in Python Christopher Trudeau 11:15

Transcript
Discussion (4)

00:00 In the previous lesson, I gave you a quick overview of what regular expressions are and how to use them. In this lesson, I’ll show you your first regular expression, showing you how to do plain character matching. Regexes can range in complexity.

00:15 The simplest concept is just a plain string match. This regular expression looks for the five letters "thing" inside of the text it’s being applied to. It’ll match anywhere where the word "thing" is. To be strictly correct, I shouldn’t use the word word. "thing" will match "thing" on the end of "something", as well as "thing" on its own.

00:39 There’s nothing in this plain text match that talks about the spacing around it, so it’s specifically only looking for those five letters. On the other end of the complexity spectrum, there’s how to match an email address header inside of RFC822.

00:54 The code I’m about to show you comes out of a Perl regular expression for a library that looks for these headers. It’s a little insane.

01:04 Don’t worry, no human being actually looks at this. It was created by comprising several smaller regular expressions that are easier to read together. Later on, I’ll show you where this came from. But for now, I’ll get you started with simpler matching. As I mentioned in the overview, regexes are a language to themselves. In order to understand how to use them in Python, you’ll need to understand both regexes and how Python uses regexes.

01:30 So to get started, I’ll be teaching you only regexes. Now I’ll show you the Python variant of them, but I won’t be showing you any Python code for the first few lessons.

01:41 I’ll just be showing you the regexes. Later on, I’ll show you how to put those inside of your Python code and how to use the re (regular expression) module. First up, I’ll show you some plain text matching.

01:53 And then after that, I’ll introduce you to character classes—a way of choosing from a range of characters. To introduce you to the language of regular expressions, I’m going to be using an online tool called pythex.

02:06 There are several different websites out there that help you build and debug regular expressions. This is just one of them. In the further reading section, I’ll show you a few more. The tool is comprised of three basic parts.

02:19 The first line is where the actual regular expression is. Then there’s a text area of the string that you are testing the regular expression against. And then down at the bottom, there’s an area showing you the matching. Throughout these lessons I’m going to be using the same chunk of text, so you’ll be very familiar with it by the time you’re done.

02:38 The text is an email message from Wile E. Coyote to the Acme corporation complaining about their products. I’ve put it inside the test string window, and now I’ll put in a regular expression.

02:53 Starting with a straight text match, I’ve put in Super with a capital S. The test string is looked through and a result is shown. Because all this doesn’t fit on the screen at once, I’m going to shrink the test string. As the matched result shows the entire message and highlights where the matches are, I don’t need both on the screen at the same time, so I’ll just be leaving this minimized. When you look at the match results, you’ll see the word 'Super' is highlighted three times.

03:25 There you go! That’s your first regular expression—a plain text match. To see this in action again,

03:33 I change the word—and sure enough, a similar result. Now the word 'Outfit' is matched.

03:42 Third time’s the charm. Now I’m matching fail. Notice that like the regular expression "thing" that I spoke of earlier, this isn’t matching a word—it’s matching the text.

03:54 So, the 'fail' in 'failure' and 'failing' gets matched. If you don’t include information about the whitespace inside of the regex, you will get a straight text match, whether it’s a word on its own or part of a longer word.

04:10 How about if there’s no match? Let me put it in another string,

04:15 and the tool tells you there are no matches. A plain text match is useful, but not particularly flexible. Let me show you how to match a range of characters.

04:25 The letter t—that’s a plain text match, as before. Square brackets ([]) indicates that I’m going to be looking at a range of information.

04:38 And here, I’m matching any vowel. So—you’re looking for a pattern that starts with the letter t and then one of the vowels. Let’s look at some of the matches: the 'tu' in 'Saturday', the 'ti' in 'Corporation', the 'te' in 'uttered'.

04:57 How about up here? What’s going on? Well, it’s hard to tell inside of this tool, but this is actually two matches—'ta' being the first one and 'te' being the second.

05:09 You need to be careful when you’re using a tool like pythex to know when you’re looking at a match and when you’re looking at multiple matches—it doesn’t make the distinction in the resulting screen.

05:20 You can also match numbers.

05:23 The hyphen (-) in this expression says that it is matching a range. So this is looking for a single digit in the range 0 to 9.

05:33 So if you were looking for four digits in a row,

05:41 you put in four sets of square brackets with 0-9 inside. The match result shows '1095', '1949', '3990', '9230', and '1949' again.

05:54 Notice the serial number here. There’s seven digits before the hyphen. Because it’s looking for four digits, it finds the first four and stops. If I change the expression

06:07 to just match three, now you’ll get '923' as the first match and '041' as the second match. Ranges can also be letters.

06:23 This is looking for the plain text match of colon (:) preceded by any letter of the alphabet. So in the matches, you get 'e:', 'n:', 'e:', 'm:', 'o:', 't:', et cetera.

06:37 Notice that the matching is actually case sensitive. There isn’t an example in this text, but if there were a capital letter followed by a colon, it would not match.

06:48 You can also do ranges on capital letters. Here’s another expression. This matches a single capital letter followed by either a capital letter or a small letter. This finds 'RP'—two capital letters, 'St'—with a capital letter and a small letter.

07:05 'MIME' is showing two matches, 'MI' and 'ME'. Similar with 'ASCII'—'AS' and 'CI'.

07:13 You can also complement a match by using the caret symbol (^), Shift + 6.

07:20 This says, “Match any character that is not in the range [a-zA-Z],” so any number or special character will show up.

07:34 This expression combines the ideas. This is looking for any digit followed by an even digit, or more specifically, not an odd digit—not the digits 1, 3, 5, 7, 9. Matches include '10', '94', '34'—oh, and '1a'! Notice that it’s just not those digits, so letters are included. This is why tools like pythex are useful when you’re building your regular expressions.

08:04 You may not actually be building the expression that you think you are, and seeing it in some example text will show you whether or not you’ve built it properly.

08:13 What if you actually want to match the hyphen symbol (-)?

08:17 You can see that here. If the - is the first letter inside of the character set, it’ll include this literally. So this is a - or the small letters [a-z], followed by a capital letter [A-Z].

08:33 This matches '-R' in the 'X-RP', '-S', '-V', and '-T'. At the moment in the text, there isn’t a small letter followed by a capital letter.

08:46 If I change the text to include that,

08:52 now you’ll get the 'nE' matching as well. A similar idea is used if you’re trying to match the caret symbol (^). If the ^ isn’t the first symbol inside of the class, it’s taken literally.

09:07 So this expression says to match the number sign (#), the colon (:), or the caret (^). Part of Wile E. Coyote’s cartoon utterance of a swear word is being matched. Notice, this is actually two matches—the '#' and the '^' separately.

09:24 Other colons in the text are also being matched.

09:28 To match square brackets, you include those inside of the square brackets. This regex is three pieces. The first part is the character left square bracket, the second part is a capital letter, and the third part is the right square bracket. Combined, these match '[X]' inside of the model number.

09:52 Alternatively, you can use the backslash to escape the special character. Backslash left bracket (\[) and backslash right bracket (\]) tell the regular expression to actually look for left and right bracket characters, matching the '[X]' in the model number below.

10:09 You have to be very careful with this. Escape sequences are used in Python strings to escape other kinds of letters, so when you have a string containing a regular expression, you can end up with escape sequences of escape sequences.

10:22 This can be a little messy. I’ll show you how to deal with this later.

10:27 One last example for you inside of character sets. You’ve seen this before—it’s matching the capital letter A or the capital letter C.

10:35 There’s another way of expressing this concept. The pipe operator acts as OR, so this expression is equivalent to the previous one, using either the letter A or the letter C. With a short expression like this one, it doesn’t make much difference which mechanism you use. When you start to do more complicated mechanisms and grouping things together, the OR symbol allows you to do things that square bracket character sets does not. So, those were your first regular expressions.

11:04 Hopefully it wasn’t too painful. In the next lesson, I’ll be talking about meta-characters: how to express things like whitespace inside of a regular expression.

born2build on Nov. 20, 2020

Great course! I thought I started to get regex until I tried this myself and stumped. How do I capture the digits in between the ‘<…>’ in this pattern? “<some 456xr:-z9987 thing>123456”

Christopher Trudeau RP Team on Nov. 21, 2020

Hi born2build,

Don’t feel bad. Even having written the course, I have to fiddle with it before I get it. I wrote a short program the other day to look for mismatched HTML tags, and it took me a bit of playing in pythex.org before I got it right.

I believe the following will capture what you’re looking for:

r'<.*?(\d+).*?(\d+).*?>'

The regex will match everything between the < and >, with two groups, the groups containing the digits. Notice the use of "*?", the non-greedy wild card. The non-greedy consumption here stops the inclusion of the digits in the ".*" portion – i.e. it makes sure not to eat the digits.

Note a couple of things: 1) this regex will ignore everything after the second “>” and 2) it will not work if you don’t have two groups of digits in the string you’re searching. If you apply the same regex to “<some 456 thing>123456”, your match groups will contain “45” and “6” – the regex is explicitly built to capture two independent groups of numbers between the angle brackets.

born2build on Nov. 23, 2020

Awesome! Just one curious question. Pythex.org is a great tool along with other forward regex tool but I wonder if there is a reverse regex tool where I can provide the pattern and the match and it will give me the regex? If not is it possible to build something like that? Or I am just dreaming in regex land?

Christopher Trudeau RP Team on Nov. 24, 2020

Hi born2build,

I’m not aware of any such thing, but the internet is a big place, maybe somebody has built it. I suspect any such tool would be rather limited in scope for certain kinds of patterns. The complexity you can achieve with a regex would make reverse engineering from the result rather difficult.

That being said, there are libraries out there that are easier to use than regex for certain kinds of pattern matching. I haven’t used it myself, but one of the other RP authors, Geir Arne, is a big fan of “parse”. He talks about it in the following article:

realpython.com/python-packages/#parse-for-matching-strings

Become a Member to join the conversation.