Locked learning resources

Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Locked learning resources

This lesson is for members only. Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Advanced Matching

Christopher Trudeau

Regular Expressions and Building Regexes in Python Christopher Trudeau 07:17

Transcript
Discussion (4)

00:00 In the previous lesson, I showed you how to use flags to modify your regular expression behavior. In this lesson, I’m going to show you some even more advanced regular expression patterns.

00:11 I’m going to cover four concepts using pythex: conditional matches, lookahead matches, lookbehind matches, and comments. Conditional matches change the matching behavior based on whether or not something is present or not.

00:25 Lookahead and lookbehind matches change the behavior of the regex based on content ahead or behind what you’re looking for, but don’t include that part in the match.

00:37 Comments are, well, comments. We’ll start with a familiar plain text match inside of a group, looking for the word ACME. I can add a quantifier to say this can occur once or not at all. For the purposes of this expression, nothing’s changed.

00:59 I’ve added a little more to it. This time it’s looking for ACME once or not at all, a whitespace (\s), the word Super, and then the period meta-character (.) zero or more times.

01:12 This essentially captures everything in a paragraph after the word ACME. Now I’m going to introduce the concept of a conditional.

01:22 The part that’s been added is inside of this group. The question mark with brackets says that this is a conditional group. The 1 is a backreference, so this conditional is conditional on the presence of backreference 1.

01:38 So, if ACME, the first group, is present, then this evaluates to True. If ACME is not present, then this evaluates to False.

01:51 The regular expression that gets run in the case of True is everything to the right of the backreference group and to the left of the OR symbol, the pipe.

02:01 This is the word Out in a group. In the case where ACME isn’t found, the other pattern is run. And in this case, it’s \w* and fit in a group.

02:14 You can see the two different places where this is happening. The first match has 'ACME', then 'Super', and because 'ACME' is present, it’s including the 'Out' part of 'Outfit'—but only the 'Out' part of 'Outfit'.

02:30 In the second case, where 'ACME' isn’t present, it’s including \w*, which in this case is also 'Out', and the 'fit' part.

02:40 You can also see how this works out inside of the groups. For the first match, 'ACME' is found and is set for the first group. The conditional is run, so the first part of the conditional is operated, which is the second group, which in this case is 'Out'. And the third group, which is the 'fit', does not get evaluated because the conditional passed. Match 2 has the opposite. The first group is empty, the second group is empty because it’s part of the True portion of the conditional, and the third group is run because it’s part of the False portion of the conditional. Because group 1, ACME wasn’t found, the second part of the conditional is run and 'fit' is matched.

03:27 Now I’m going to show you a lookahead.

03:31 I’m going to start out with a regular expression without a lookahead in it. I’m looking for the word writing, some whitespace, and then the literal letter t inside of a group.

03:44 I change the group to be lookahead behavior using the ?= operator. What happens here is the regular expression is still looking for writing \s(t), but the ?=t portion is not consumed.

04:01 No group is created, and more importantly, the t isn’t considered part of the match. If you had more matching criteria later on, the t would participate in that matching criteria.

04:15 The reason this is called a lookahead is because you’re looking for writing, the match looks ahead to see if it’s a t, if it does find it, then it matches, but it doesn’t use the t as part of the evaluation.

04:32 Here’s another example. This is looking for 4 digits and then followed in a lookahead group by the literal [, a \w, and the literal ].

04:46 This matches the model numbers below. Because it’s a lookahead, only the '3990' actually participates in the match. But you’ll notice, because the lookahead is there, none of the other four-digit numbers is matched. The lookahead looks for the [<letter>] form, limiting this to just the digits inside of the model number.

05:11 Changing the equal sign (=) to an exclamation mark (!) changes it to a negative lookahead, meaning “Only match situations where the digits are not followed by [<letter>].” This matches all the four-digit numbers that aren’t associated with the model number.

05:32 ?<= is lookbehind. This is a similar concept to lookahead, but it happens before the pattern that you’re looking for. So in this situation, I’m looking to match the literal [, \w, literal ], preceded by 4 digits.

05:52 Notice that I was explicit about how many digits was here. Due to the way regular expressions are implemented, lookbehinds have to be of a fixed length.

06:01 If I change the \d{4} to be \d+ it will fail.

06:10 Regular expressions are built on something called finite-state machines. Finite-state machines only allow certain kinds of computing patterns, and this is not one of them. In the reference material in a later lesson, I’ll show you where you can dig out more information on how this works.

06:29 You can also negate lookbehind. ?<! is a negative lookbehind. This is looking for two digits, not preceded by two digits. And finally, every good programmer knows to put comments in their code.

06:48 You can put comments inside of your regular expression. ?# is the comment symbol. Anything inside of the group is ignored. This is part of the regular expression standard. In Python, where you have the VERBOSE flag, I would much rather use that.

07:04 It’s far clearer than trying to insert this inside of your regex.

07:10 Next up, I’ll show you some fun regular expressions. And by fun, I mean horrific.

Roy Telles on March 17, 2021

The video uses a ? quantifier but says “searches for ACME zero or more times” but this quantifier is used for zero or one time, so I think there may need to be an edit.

Christopher Trudeau RP Team on March 17, 2021

Thanks Roy. You’re right. We’ll get a patch in on that.

Rahul Pandey on Aug. 23, 2023

Hey Christopher, what if I want to use a variable to check for a match like for this string:

re.search(r"var1(?:¦|:|;|-)[ ]{0,1}(.*?)\\n\\x0c", texts).group(1)

Christopher Trudeau RP Team on Aug. 24, 2023

Hi Rahul,

The regex itself being passed into re.search() or any other regex method is also text. You can build the text the way you would with any other variables, either adding strings or using an f-string.

One caution though, remember that I’m using raw strings r"..." to avoid all the escaping as backslashes are important to a regex. Depending on how you build your string it won’t be raw. You may also have to escape the contents of the variable itself.

So, something like this:

>>> import re
>>> text = "cat dog cat3 dog2"
>>> look = "cat"
>>> my_regex = re.escape(look) + r'([0-9])'
>>> re.search(my_regex, text)
<re.Match object; span=(8, 12), match='cat3'>
>>> re.search(my_regex, text).group(1)
'3'

Become a Member to join the conversation.