Open In App

Python Regex: Replace Captured Groups

Last Updated : 03 Sep, 2024
Summarize
Comments
Improve
Suggest changes
Like Article
Like
Share
Report
News Follow

Regular Expressions, often abbreviated as Regex, are sequences of characters that form search patterns. They are powerful tools used in programming and text processing to search, match, and manipulate strings. Think of them as advanced search filters that allow us to find specific patterns within a text, such as email addresses, dates, phone numbers, or any custom pattern we can imagine.

Regex is supported in almost all major programming languages, and in Python, the `re` module provides an extensive set of functionalities for regex operations. This article dives into one of the crucial aspects of regex in Python: Regex Groups, and demonstrates how to use `re.sub()` for group replacement with practical examples.

Understanding Regex Groups

A group in regex is a part of a pattern that is enclosed in parentheses. Groups allow us to segment a pattern into sub-patterns, making it easier to apply specific operations on each part. These groups are indexed numerically from 1 for the first group and so on. The entire match is group 0.

For example, consider the regex pattern:

import re

pattern = r"(Hello) (World)"

In this pattern, there are two groups:

  • Group 1: Hello
  • Group 2: World

Groups can be used to:

  1. Extract parts of the matched string: If we want to extract or capture a specific part of the text that matches the pattern, groups allow us to do this.
  2. Apply repetitions or conditions to specific parts: Groups also allow us to apply quantifiers like `*`, `+`, `?`, or `{n,m}` to specific parts of your pattern.
  3. Use backreferences: In regex, a backreference allows us to refer to a previously captured group, which is helpful in certain advanced regex patterns.

Types of Capturing Groups

1. Simple Capturing Groups: These are the basic groups created by enclosing part of the regex in parentheses.

pattern = r"(cat)"

2. Non-Capturing Groups: Sometimes we want to group parts of the pattern but don't need to capture them for later use. In such cases, we can use non-capturing groups, denoted by (?:...).

pattern = r"(?:cat)"

3. Named Capturing Groups: These groups are captured by a name rather than a numerical index, making the regex more readable. Named groups are defined with (?P<name>...).

pattern = r"(?P<animal>cat)"

4. Lookahead and Lookbehind Groups: These groups assert whether a certain pattern is ahead or behind the current position without consuming characters.

# Positive lookahead
pattern = r"cat(?=dog)"

# Positive lookbehind
pattern = r"(?<=dog)cat"

Let us illustrate it with an example. Here, text is a string that contains two email addresses: geek.joe@example.com and wane.smith@example.org.

import re

text = "geek mail geek.joe@example.com, and wane's email is wane.smith@example.org"

pattern = r"(\w+\.\w+)@(\w+\.\w+)"

matches = re.findall(pattern, text)
print(matches)

Output
[('geek.joe', 'example.com'), ('wane.smith', 'example.org')]

Explanation:

  • (\w+\.\w+): This is a capturing group:
    • \w+: Matches one or more word characters (letters, digits, and underscores).
    • \.: Matches a literal dot (.).
    • So, (\w+\.\w+) matches a pattern like geek.joe or wane.smith.
  • @: Matches the @ symbol, which separates the local part of the email address from the domain.
  • (\w+\.\w+): The second capturing group is similar to the first and matches the domain part, like example.com or example.org.

Replacing Captured Groups Using `re.sub()`

The re.sub() function is used to replace occurrences of a regex pattern in a string with a specified replacement. When using groups, we can reference these groups in the replacement string using backreferences like `\1`, `\2`, etc., corresponding to the first, second, and subsequent groups.

Syntax of `re.sub()`

re.sub(pattern, replacement, string, count=0, flags=0)
  • `pattern`: The regex pattern to search for.
  • `replacement`: The replacement string. You can use backreferences to include captured groups in the replacement.
  • `string`: The original string where the replacement is performed.
  • `count`: The maximum number of replacements. Default is `0`, which means replace all occurrences.
  • `flags`: Optional flags to modify the regex behavior.

Python provides several ways to replace captured groups in a string:

Example 1. Using re.sub() with Group References:

import re

# Simple group reference
text = "The cat sat on the mat."

pattern = r"(cat)"

replaced_text = re.sub(pattern, r"dog", text)
print(replaced_text)  # Output: The dog sat on the mat.

Output

The dog sat on the mat.

Example 2: Using re.sub() with Named Group References:

Named groups can be referenced using \g<name> in the replacement string.

import re

text = "The cat sat on the mat."

pattern = r"(?P<animal>cat)"

replaced_text = re.sub(pattern, r"dog", text)
print(replaced_text)  # Output: The dog sat on the mat.

Output

The dog sat on the mat.

Example 3 - Replacing Multiple Groups:

We need to reformat a date from `DD-MM-YYYY` format to `YYYY-MM-DD` format.

import re

date_text = "Today's date is 02-09-2024 and yesterday was 01-09-2024."

date_pattern = r"(\d{2})-(\d{2})-(\d{4})"

reformatted_text = re.sub(date_pattern, r"\3-\2-\1", date_text)

print(reformatted_text)
# Output: Today's date is 2024-09-02 and yesterday was 2024-09-01.

Output
Today's date is 2024-09-02 and yesterday was 2024-09-01.

Here, `\1` refers to the first group `(day)`, `\2` to the second group `(month)`, and `\3` to the third group `(year)`. By rearranging these backreferences in the replacement string, we reformat the dates.

Conclusion

Regular expressions are powerful, but they can be daunting at first. Understanding how groups work and using functions like `re.sub()` effectively can significantly enhance our string manipulation capabilities. With practice, we'll find regex an indispensable tool in our programming toolkit.


Next Article
Article Tags :
Practice Tags :

Similar Reads

three90RightbarBannerImg