Accessing Groups

Regular Expressions and Building Regexes in Python Christopher Trudeau 03:42

Transcript
Discussion (11)

00:00 In the previous lesson, I showed you the re module of Python, the module that exposes the power of regular expressions inside of Python code.

00:09 In this lesson, I’ll show you how to get at those groups defined in your regex. You’ve already seen some of the attributes and methods on the re.Match object. There are other ones, though, around grouping. The .group() function takes a list of arguments.

00:26 It returns a tuple of matched groups for each of the numbers you give it. So if you call .group(1, 3), it’ll return the first and third match in a tuple. .groups() plural returns all of the matched groups.

00:42 You’ve seen the .start() and .end() methods showing the beginning and end of matches in the string. The .expand() method takes a string template and substitutes any backreferences inside of the template with their actual results.

00:57 This is kind of like the idea of an f-string, but with the backreferences from your regex. Finally, there’s also the .span() method, which you’ve seen before as well, which contains the start and end values in a tuple.

01:12 Starting out by importing the library. Let’s do a search with a group match.

01:21 This regex is looking for a group consisting of one or more word meta-characters followed by a comma, followed by another group with one or more word characters.

01:33 The .groups() function shows the matching groups. The first group is the word 'one', and the second group is the word 'two'.

01:46 You can see an individual group by passing a number to the .group() function.

01:53 You can also pass in multiple.

01:58 If you pass in more than one, you get back a tuple in the order that you put it in. You may notice that these are all one-referenced. This is to be consistent with the concept of a backreference.

02:11 .group(0) is the entire match. Because this regular expression is looking for some number of meta-characters, a comma, and then some number of meta-characters, the 'one' and the 'two' match, but the 'three' does not because there’s only one instance.

02:26 You can also use indexes on the Match object to get the same content. Once again, this is the backreference \1. 0 is the entire match, just like match.group().

02:41 .expand() gives you a template to be able to expand the backreferences. This takes the results of the match and then inserts the backreferences that you reference inside of the template into the string.

02:56 This is kind of like an f-string for your regular expression matches. You’ve seen the .start(), .end(), and .span() functions before, but for completeness’s sake, here they are again.

03:11 There’s a plain text .search(). There’s the Match. Starting value of 4, ending value of 7, and the span of 4 to 7. Just as a reminder, these are zero-indexed, the same as a slice in a list or a string.

03:35 Next up, I’ll show you how to name groups so that you don’t have to just use numbers.

raulfz on Jan. 28, 2021

Thank you for this tutorial,

I would like to know which method would give as result the two matches, that is:(one,two) and (two,three), I mean recursive matching.

I haven’t read the docs yet, but is there a flag or meta-character in python to alternate between greedy and not greedy matches?

Thank you.

Bartosz Zaczyński RP Team on Jan. 29, 2021

@raulfz Apparently, the built-in re module in Python standard library doesn’t support recursive matching. You can try out a third-party module such as regex, for example.

By default, quantifiers are greedy, but you can make them lazy by appending the question mark (?) meta-character.

raulfz on Jan. 29, 2021

Thank you @Bartosz Zaczyński, keep it up with these excellent tutorials!

Best, R.

sheldon on Nov. 4, 2021

I don’t understand the use of the 'r' (raw string literal?) in: r"(\w+),(\w+)"

Why is it needed? I understand the regex except for the 'r' prefix.

Thank you

Bartosz Zaczyński RP Team on Nov. 5, 2021

@sheldon Python’s raw strings disable the evaluation of special character sequences denoted with a backslash such as a \", \\, or \n. In regular strings, those are replaced with non-printable characters, such as a newline, or other characters that would lead to a syntactical error:

>>> print("hello\nworld")
hello
world
>>> print('My name\'s Joe.')
My name's Joe.

On the other hand, if you wanted to print such character sequences literally, then you’d need to escape them with another backslash:

>>> print("hello\\nworld")
hello\nworld
>>> print('My name\\\'s Joe.')
My name\'s Joe.

However, that can get cumbersome with regular expressions, which are primarily comprised of special character sequences. Raw strings let you avoid name collisions between special characters, which look the same across standard strings and regular expressions but have a different meaning. For example, the sequence \n in a regular expression is meant to match a newline character rather than insert a literal line break into the pattern.

Christopher Trudeau RP Team on Nov. 5, 2021

Thanks for the detailed response Bartosz!

@sheldon – the details of escaping regexes, all the backslashes, and using the raw string are covered in the lesson after this one.

Cindy on March 27, 2022

Hi Bartosz,

I am wondering about why the first example applies one backslash to escape while the second applies two?

>>> print("hello\\nworld")
hello\nworld

>>> print('My name\\\'s Joe.')
My name\'s Joe

Bartosz Zaczyński RP Team on March 27, 2022

@Cindy A backslash character affects exactly one character that appears immediately to the right. Therefore, Python will replace two consecutive backslash characters (\\) in a string literal with one backslash in the output:

>>> print("\\")
\

>>> print("hello\\nworld")
hello\nworld

This is necessary because Python treats a single backslash character as the beginning of an escape character sequence such as \n or \t. Without it, you’d get a syntax error, even in raw strings mentioned before:

>>> print("\")
  File "<stdin>", line 1
    print("\")
          ^
SyntaxError: unterminated string literal (detected at line 1)

>>> print(r"\")
  File "<stdin>", line 1
    print(r"\")
          ^
SyntaxError: unterminated string literal (detected at line 1)

In your second example, there are two separate escape character sequences:

>>> print('My name\\\'s Joe.')
My name\'s Joe.

The first one is \\ → \ and the second one is \' → '. Escaping the single quote is required in this case to avoid closing the string literal prematurely. Alternatively, you could enclose the string in double quotes to avoid escaping the single quote:

>>> print("My name\\'s Joe.")
My name\'s Joe.

Christopher Trudeau RP Team on March 27, 2022

Hi Cindy,

The first example is escaping a single slash. Normally, “\n” is a reserved marker meant for newline. To actually print a backslash and then an “n”, you have to escape the backslash.

The second example is escaping two things: the first escape is the backlash, the second is the single quote. If I had used double quotes for that example, you wouldn’t need the escaped single-quote.

Notice the difference:

>>> print("My name\\'s Joe.")
My name\'s Joe
>>> print('My name\\\'s Joe.")
My name\s Joe

Same result, but the choice of what kind of quotations to use for the string changes what needs to be escaped.

Christopher Trudeau RP Team on March 27, 2022

Evidently @Bartosz and I are replying at the same time using the same examples, how’s that for being in sync :)

Cindy on March 27, 2022

@ Bartosz and Christopher, thank you very much for the detailed explanation. It is really helpful!

Become a Member to join the conversation.