Transforming Text with re.sub

Introduction

Welcome to the final lesson of Extracting Data with Capture Groups in Python! You've come a long way through this course, building skills that let you capture structured data with named groups, enforce consistency with backreferences, and extract practical information like emails and prices from messy text. In each lesson, you've focused on finding and extracting data, which is powerful, but there's another equally important skill: transforming the data you find into a different format. In this lesson, we'll learn to use re.sub(), Python's regular expression substitution function, to search for patterns and replace them with new text. This is far more powerful than simple string replacement because we can transform what we find rather than just replacing it wholesale. For example, we might need to standardize phone numbers from various formats into a single consistent format or redact sensitive information while preserving parts of it for context. These transformations require us to understand what we captured and intelligently modify it. We'll start by exploring how re.sub() works at a basic level, then introduce numbered backreferences that let us reuse captured groups in our replacement strings. You'll see how to normalize messy phone numbers into a standard format by rearranging their captured components. Next, we'll tackle more complex scenarios where simple replacements aren't enough: you'll learn to use callback functions that execute custom logic on each match. By the end of this lesson, you'll be able to transform text patterns in sophisticated ways, completing your toolkit for both extracting and modifying data with regular expressions.

Understanding Text Transformation

Before diving into code, let's understand what makes re.sub() fundamentally different from Python's standard str.replace() method. With str.replace(), you specify an exact string to find and an exact string to replace it with. Every occurrence of "hello" becomes "goodbye," for instance. This works well for fixed text, but it falls apart when data varies: not all phone numbers look the same, and not all prices follow identical formatting. The re.sub() function solves this by accepting a pattern rather than a fixed string. You define what to look for using all the regex tools you've learned: character classes, quantifiers, capture groups, and anchors. The function finds every match of your pattern, then replaces each match with new text. The replacement can be a simple string, but here's where it gets interesting: you can reference the captured groups from your pattern, letting you reorganize, reformat, or selectively modify the matched content. This capability transforms re.sub() from a simple find-and-replace tool into a powerful data transformation engine. Instead of just changing text, you're restructuring it. A phone number like "(415) 555-2671" contains all the information needed to create "+1-415-555-2671," but those pieces need to be rearranged and reformatted. Similarly, alice.smith@example.com can become a***@example.com by keeping the first character, hiding the rest, and preserving the domain. These aren't simple replacements; they're intelligent transformations based on what was captured.

Basic Replacement with re.sub

Let's start with the simplest form of re.sub() to understand its syntax before adding complexity. The function takes three main arguments: the pattern to search for, the replacement text, and the string to process. Pythonimport re # Replace all sequences of whitespace with a single space text = "Hello world\t\tfrom Python" result = re.sub(r'\s+', ' ', text) print(result)import re # Replace all sequences of whitespace with a single space text = "Hello world\t\tfrom Python" result = re.sub(r'\s+', ' ', text) print(result) Here, the pattern r'\s+' matches one or more whitespace characters: spaces, tabs, newlines, anything classified as whitespace. The replacement string is a single space ' '. The re.sub() function scans through text, finds every match of the pattern, and replaces each match with the replacement string. The effect is normalizing all whitespace to single spaces. textHello world from PythonHello world from Python Notice that three different types of whitespace in the original text (multiple spaces, tabs, mixed) all became single spaces in the result. This demonstrates the pattern-based nature of re.sub(): we didn't need to know the exact whitespace characters to replace them. The pattern matched them all, and the replacement was applied uniformly. Now let's see how to incorporate captured data into our replacements.

Numbered Backreferences in Replacement Strings

The real power of re.sub() emerges when we use numbered backreferences in the replacement string. You've seen backreferences before in patterns themselves, where \1 referred back to the first captured group within the same pattern. In replacement strings, numbered backreferences work similarly but serve a different purpose: they insert the captured content into the new text. Python# Swap two words: "first second" becomes "second, first" pattern = r'(\w+) (\w+)' replacement = r'\2, \1'# Swap two words: "first second" becomes "second, first" pattern = r'(\w+) (\w+)' replacement = r'\2, \1' The pattern (\w+) (\w+) captures two separate words. The first word goes into group 1, and the second into group 2. In the replacement string r'\2, \1', we reference these groups in reverse order: \2 inserts the content of the second group (the second word), then we add a comma and space literally, then \1 inserts the content of the first group (the first word). The backslashes before the numbers are essential: they tell Python these are backreferences, not literal text. This technique lets us reorganize data: take pieces from their original positions and reassemble them in a new structure. We can add literal text around the captured groups, reorder them, repeat them, or use only some of them. The replacement string becomes a template where \1, \2, \3, and so on act as placeholders for captured content. Let's apply this to a practical problem.

Normalizing Phone Numbers with Backreferences

Now let's tackle a common data cleaning task: standardizing phone numbers. Phone numbers appear in many formats: "(415) 555-2671" with parentheses, "415-555-8899" with hyphens, and "2125550000" with no separators. We want to convert all of these into a consistent format like "+1-415-555-2671," where the country code, area code, prefix, and line number are clearly separated. Pythondef normalize_phone_numbers(text): # Pattern captures area code, prefix, and line number pattern = r'\(?(\d{3})\)?[\s-]?(\d{3})[\s-]?(\d{4})' # Replacement uses captured groups in standard format return re.sub(pattern, r'+1-\1-\2-\3', text)def normalize_phone_numbers(text): # Pattern captures area code, prefix, and line number pattern = r'\(?(\d{3})\)?[\s-]?(\d{3})[\s-]?(\d{4})' # Replacement uses captured groups in standard format return re.sub(pattern, r'+1-\1-\2-\3', text) The pattern r'\(?(\d{3})\)?[\s-]?(\d{3})[\s-]?(\d{4})' is carefully designed to match different formats. Let's break it down: \(? matches an optional opening parenthesis, (\d{3}) captures exactly three digits (the area code) as group 1, \)? matches an optional closing parenthesis, and [\s-]? matches an optional space or hyphen. This sequence handles area codes written as "(415)," "415," or "415-" equally well. The pattern then repeats similar logic for the three-digit prefix (group 2) and four-digit line number (group 3). The replacement string r'+1-\1-\2-\3' constructs the standardized format. It starts with the literal text "+1-" for the country code, then inserts the captured area code with \1, adds a literal hyphen, inserts the prefix with \2, adds another hyphen, and finishes with the line number from \3. Every phone number, regardless of its original format, gets transformed into this consistent structure. The captured groups preserve the actual digits while the replacement string provides the new formatting.

Testing Phone Number Normalization

Let's test our normalization function with a variety of phone number formats to verify it handles them all correctly: Python phones_text = "Call (415) 555-2671 or 415-555-8899, alt 2125550000." print(normalize_phone_numbers(phones_text)) phones_text = "Call (415) 555-2671 or 415-555-8899, alt 2125550000." print(normalize_phone_numbers(phones_text)) The test string contains three phone numbers in different formats: "(415) 555-2671" with parentheses and spaces, "415-555-8899" with hyphens, and "2125550000" with no separators at all. Our pattern must recognize all three as valid phone numbers and transform them identically. text Call +1-415-555-2671 or +1-415-555-8899, alt +1-212-555-0000. Call +1-415-555-2671 or +1-415-555-8899, alt +1-212-555-0000. Perfect! All three numbers have been normalized to the "+1-AAA-BBB-CCCC" format. The parentheses, spaces, and hyphens from the original text are gone, replaced by a consistent structure. Even the ten-digit string "2125550000" was correctly parsed into area code "212," prefix "555," and line number "0000." The surrounding text ("Call," "or," "alt") remained unchanged because it didn't match the pattern. This demonstrates how re.sub() with backreferences transforms only the matched portions while leaving everything else intact.

Callback Functions for Complex Logic

Numbered backreferences are powerful, but they have limitations: you can only rearrange and insert captured text, adding literal strings around it. What if you need to perform calculations on the captured data, apply conditional logic, or use standard Python functions? For these scenarios, re.sub() accepts a callback function instead of a replacement string. Python# Instead of a string, pass a function to re.sub pattern = r'(\d+)' result = re.sub(pattern, some_function, text)# Instead of a string, pass a function to re.sub pattern = r'(\d+)' result = re.sub(pattern, some_function, text) When you provide a callback function, re.sub() calls that function once for each match it finds. The function receives a match object as its argument (just like what re.search() returns), and it must return a string that will replace the match. Inside your callback, you can access captured groups with m.group(1), m.group(2), and so on. You can also use any Python code: string methods, arithmetic, conditionals, and external function calls. This approach is particularly useful when the transformation depends on the captured content. For instance, you might want to redact email addresses but preserve the domain and the first character of the username for context. You can't do this with a simple replacement string because you need to compute which characters to keep and which to replace with asterisks. A callback function gives you the flexibility to implement this logic in Python, then return the transformed string. Let's see this in action.

Redacting Emails with a Callback

Now let's implement a more complex transformation: redacting email addresses to protect privacy while keeping enough information for context. We want alice.smith@example.com to become a***@example.com: just the first character of the username is visible, the rest is replaced with asterisks, and the domain remains unchanged. Pythondef redact_emails(text): # Capture first character of username separately, rest of username, and full domain pattern = r'([\w\.-])[\w\.-]*@([\w\.-]+\.\w+)' # Define callback function that constructs redacted email def repl(m): return f"{m.group(1)}***@{m.group(2)}" return re.sub(pattern, repl, text)def redact_emails(text): # Capture first character of username separately, rest of username, and full domain pattern = r'([\w\.-])[\w\.-]*@([\w\.-]+\.\w+)' # Define callback function that constructs redacted email def repl(m): return f"{m.group(1)}***@{m.group(2)}" return re.sub(pattern, repl, text) The pattern r'([\w\.-])[\w\.-]*@([\w\.-]+\.\w+)' is cleverly designed. The first part ([\w\.-]) captures exactly one character from the username: a word character, dot, or hyphen. Then [\w\.-]* matches the rest of the username (zero or more characters) without capturing it. This distinction is crucial: we capture the first character because we need it, but we don't capture the rest because we're going to replace it with asterisks anyway. After the @ symbol, ([\w\.-]+\.\w+) captures the entire domain as group 2. The callback function repl takes the match object m and constructs the redacted email. It accesses the first character with m.group(1), adds three asterisks literally, adds the @ symbol, and then appends the domain from m.group(2). This f-string becomes the replacement text for that particular match. Notice how the logic is expressed naturally in Python: we couldn't do this with numbered backreferences alone because we're not just rearranging captured groups; we're adding asterisks based on what we found.

Testing Email Redaction

Let's test the redaction function with several email formats to ensure it handles different username styles correctly: Python emails_text = "Contact alice.smith@example.com and Bob-B@example.co.uk today." print(redact_emails(emails_text)) emails_text = "Contact alice.smith@example.com and Bob-B@example.co.uk today." print(redact_emails(emails_text)) The test string contains two different email formats: alice.smith@example.com with a dot in the username and a standard domain, and Bob-B@example.co.uk with a hyphen in the username and a country-specific domain. Both should be redacted while preserving the first character and full domain. text Contact a***@example.com and B***@example.co.uk today. Contact a***@example.com and B***@example.co.uk today. Excellent! Both emails were successfully redacted. alice.smith@example.com became a***@example.com, preserving the lowercase 'a' and the full domain. Bob-B@example.co.uk became B***@example.co.uk, preserving the uppercase 'B' and the multi-part domain. The callback function handled each match individually, extracting the first character (whether lowercase or uppercase, letter or allowed special character) and building the appropriate redacted version. The surrounding text remained unchanged, and the domains stayed fully visible for context.

Putting It All Together

Conclusion and Next Steps

Congratulations on completing the final lesson of Extracting Data with Capture Groups in Python! You've built a comprehensive skill set for working with regular expressions: from basic pattern matching through sophisticated capture groups, backreferences, and now text transformation with re.sub(). In this lesson, you learned to use numbered backreferences like \1 and \2 to rearrange captured data in replacement strings, giving you the power to normalize and reformat text. You also discovered how callback functions extend re.sub() beyond simple replacements, letting you apply arbitrary Python logic to each match. These techniques turn regular expressions from a search tool into a complete data transformation system. This marks a significant milestone: you've reached the end of this course! You started by learning to capture structured data with named groups, then used backreferences to enforce consistency in patterns, extracted practical information like emails and prices from real-world text, and finally mastered text transformation to standardize and modify captured data. Each lesson built on the previous one, expanding your regex capabilities and confidence. The patterns you've written and the problems you've solved give you skills that apply across countless programming challenges. The journey doesn't stop here, though. Regular expressions have even more advanced features waiting for you in the next course: Regex Validation, Flags, and Text Processing in Python. You'll learn to validate entire inputs with strict rules, control matching behavior with powerful flags, use lookahead assertions for sophisticated conditional matching, and process large documents efficiently with iterators. These advanced techniques will make you truly proficient at handling complex text processing tasks in any Python project. But first, let's solidify what you've learned today. The upcoming practice section will challenge you to reformat dates using backreferences, generate descriptive labels from compact codes, and anonymize usernames with callback functions. These exercises mirror the real-world scenarios where re.sub() proves invaluable. Get ready to transform some text!