Conditional Matching with Lookaheads

Introduction

Welcome back to Regex Validation, Flags, and Text Processing in Python! You've now completed two lessons, and your regex skills are growing impressively. In the first lesson, you learned to validate complete inputs with re.fullmatch(), creating username and password validators that check every character from start to finish. In the previous lesson, you mastered regex flags to control pattern behavior: writing readable, verbose patterns with comments, performing case-insensitive searches, handling line boundaries with re.MULTILINE, and matching across newlines with re.DOTALL . These tools give you tremendous flexibility in how patterns process text. However, there's another challenge that neither validation rules nor flags directly solve: what if you need to match text based on what comes before or after it, but without including that context in the match itself? Imagine extracting dollar amounts, but only when they're followed by the word "USD"; or finding passwords in a configuration file, but excluding any marked as "REDACTED"; or validating that a password contains both uppercase and lowercase letters without caring about their order. These scenarios require checking conditions without consuming the characters being checked. This lesson introduces lookaheads : powerful assertions that let you peek ahead in the text to verify conditions without including those characters in your match. Let's explore how lookaheads enable sophisticated conditional matching.

Understanding Lookaheads and Zero-Width Assertions

Lookaheads are a type of zero-width assertion: they check whether a pattern exists at a specific position without actually consuming any characters. Think of them as looking ahead in the text to verify something is there, then stepping back to continue matching from where you were. This "look but don't consume" behavior is fundamentally different from normal pattern matching, where the regex engine advances through the string as it matches each component. There are two types of lookaheads in Python's regex engine: Positive lookaheads (?=...) succeed if the pattern inside matches at the current position, but they don't advance the matching position. Negative lookaheads (?!...) succeed if the pattern inside does not match at the current position, again without consuming characters. Consider extracting prices from text where some are in USD and others in EUR. A normal pattern like r'\d+\.\d{2} USD' would match "12.50 USD," but it includes " USD" in the captured result. With a positive lookahead r'\d+\.\d{2}(?= USD)', you match the price only if it's followed by " USD," but the lookahead doesn't consume " USD," so your match contains just the numeric amount. Similarly, negative lookaheads let you exclude matches based on what follows. These assertions become especially powerful for validation, where you need to ensure certain characters exist somewhere in the string without caring about their exact position.

Positive Lookaheads for Context-Aware Matching

Let's explore positive lookaheads through a practical example: extracting prices from text, but only when they're followed by "USD." We want the numeric amounts themselves, not the currency label. A positive lookahead lets us specify this condition precisely. Pythonimport re def extract_prices_with_currency(text): # Match amounts only when followed by USD (not consumed) pattern = r'\d+(?:\.\d{2})?(?=\s*USD)' return re.findall(pattern, text)import re def extract_prices_with_currency(text): # Match amounts only when followed by USD (not consumed) pattern = r'\d+(?:\.\d{2})?(?=\s*USD)' return re.findall(pattern, text) This pattern breaks down into three key components: \d+ matches one or more digits for the dollar amount. (?:\.\d{2})? optionally matches a decimal point followed by exactly two digits; the (?:...) creates a non-capturing group so we can apply ? to the entire decimal portion. (?=\s*USD) is our positive lookahead, checking that optional whitespace and "USD" follow the number, but without including them in the match. The lookahead (?=\s*USD) ensures we only capture prices that have "USD" after them. The pattern \s* allows for zero or more spaces between the amount and "USD," handling both "12.50USD" and "12.50 USD." Crucially, the lookahead doesn't consume "USD" or the whitespace, so if multiple prices appear consecutively, we can match them all without the currency label interfering with subsequent matches.

Testing Price Extraction with Lookaheads

Let's see how our lookahead-based price extraction works on text containing various prices: Python prices = "Items: 12.50 USD, 9 USD, 100EUR, 7.50 usd, 20USD, sum 41.50 USD." print(extract_prices_with_currency(prices)) prices = "Items: 12.50 USD, 9 USD, 100EUR, 7.50 usd, 20USD, sum 41.50 USD." print(extract_prices_with_currency(prices)) This text contains several prices: some with spaces before "USD," one in EUR that we should skip, one in lowercase "usd" that we should also skip, and prices with and without decimal points. text ['12.50', '9', '20', '41.50'] ['12.50', '9', '20', '41.50'] Perfect! The output shows exactly the four prices that are followed by uppercase "USD": "12.50 USD," "9 USD," "20USD," and "41.50 USD." Notice what was excluded: "100EUR" failed because it's followed by "EUR," not "USD;" "7.50 usd" failed because our pattern specifies uppercase "USD," and regex is case-sensitive by default. The amounts themselves don't include "USD" or any whitespace, precisely because the lookahead checked for these characters without consuming them. Each price appears as a clean numeric string, ready for conversion to a float or further processing.

Negative Lookaheads to Exclude Patterns

While positive lookaheads check that something is present, negative lookaheads verify that something is not present. This becomes invaluable when you need to exclude certain matches based on context. Let's extract passwords from a configuration file, but skip any explicitly marked as "REDACTED." Pythondef find_passwords_in_config(text): # Match password values not followed by REDACTED pattern = r'password\s*=\s*(?!REDACTED\b)(\S+)' return re.findall(pattern, text)def find_passwords_in_config(text): # Match password values not followed by REDACTED pattern = r'password\s*=\s*(?!REDACTED\b)(\S+)' return re.findall(pattern, text) This pattern demonstrates negative lookahead in action: password\s*=\s* matches "password" followed by optional whitespace, an equals sign, and more optional whitespace. (?!REDACTED\b) is our negative lookahead, ensuring the next text is not "REDACTED" followed by a word boundary; if "REDACTED" appears at this position, the entire match fails. (\S+) captures one or more non-whitespace characters as the actual password value. The word boundary \b after "REDACTED" is critical here: without it, the pattern would reject passwords that merely start with "REDACTED," like "REDACTED_KEY." The boundary ensures we only exclude the exact word "REDACTED," not passwords that contain it as a prefix. The negative lookahead effectively says, "continue matching only if you don't see REDACTED as a complete word here."

Testing Password Extraction

Let's verify that our negative lookahead correctly filters out redacted passwords: Pythonconfig = "user=alice\npassword=secret123\npassword=REDACTED\npassword = token_abc\n" print(find_passwords_in_config(config))config = "user=alice\npassword=secret123\npassword=REDACTED\npassword = token_abc\n" print(find_passwords_in_config(config)) This configuration string contains three password lines: one with "secret123," one marked as "REDACTED," and one with "token_abc" that has spaces around the equals sign. text['secret123', 'token_abc']['secret123', 'token_abc'] Excellent! The output contains exactly the two non-redacted passwords: "secret123" and "token_abc." The line password=REDACTED was correctly excluded because the negative lookahead (?!REDACTED\b) detected "REDACTED" at that position and caused the match to fail. Notice that the pattern handled the varying whitespace around the equals signs correctly, matching both password=secret123 and password = token_abc. This demonstrates how negative lookaheads elegantly filter matches based on conditions without needing complex alternatives or multiple patterns.

Combining Multiple Lookaheads for Validation

Lookaheads become even more powerful when combined. Each lookahead can check a different condition, and all must succeed for the overall pattern to match. This technique is particularly useful for validation, where you need to ensure multiple requirements are met without caring about the order in which they appear. Let's validate passwords that must contain at least one lowercase letter, at least one uppercase letter, and at least one digit, with a minimum length of eight characters. Pythondef validate_strong_password(p): # Use lookaheads to require at least one lowercase, uppercase, and digit; min length 8 # Each (?=...) assertion checks the entire string without consuming characters return bool(re.fullmatch(r'(?=.*[a-z])(?=.*[A-Z])(?=.*\d).{8,}', p))def validate_strong_password(p): # Use lookaheads to require at least one lowercase, uppercase, and digit; min length 8 # Each (?=...) assertion checks the entire string without consuming characters return bool(re.fullmatch(r'(?=.*[a-z])(?=.*[A-Z])(?=.*\d).{8,}', p)) This pattern employs three positive lookaheads in sequence: (?=.*[a-z]) checks that somewhere in the string there's at least one lowercase letter; the .* matches any characters leading up to it. (?=.*[A-Z]) similarly checks for at least one uppercase letter anywhere in the string. (?=.*\d) ensures at least one digit appears somewhere. .{8,} finally matches the actual password, requiring at least eight characters of any type. The beauty of this approach is that each lookahead starts from the same position at the beginning of the string and scans forward independently. They don't interfere with each other or consume characters, so all three can verify their conditions regardless of where those characters appear in the password. After all lookaheads succeed, the .{8,} does the actual matching to ensure sufficient length. This pattern would reject "Password" (no digit), "password1" (no uppercase), "PASSWORD1" (no lowercase), and "Pass1" (too short), while accepting "Password1" or "StrongP4ss."

Testing Strong Password Validation

Let's test our multi-lookahead validator with several passwords to see which pass and fail: Pythonprint(validate_strong_password("StrongP4ss")) print(validate_strong_password("NoDigitsHere")) print(validate_strong_password("short7A"))print(validate_strong_password("StrongP4ss")) print(validate_strong_password("NoDigitsHere")) print(validate_strong_password("short7A")) These test cases check: a password meeting all requirements, one missing digits, and one that's too short despite having the right character types. textTrue False FalseTrue False False The results clearly demonstrate how each lookahead enforces its condition. "StrongP4ss" passes because it contains lowercase letters ("trong," "ss"), uppercase letters ("S," "P"), a digit ("4"), and is eight characters long. "NoDigitsHere" fails because, while it has uppercase, lowercase, and sufficient length, the (?=.*\d) lookahead finds no digits anywhere in the string. "short7A" fails despite having all character types because it's only seven characters, not meeting the .{8,} minimum length requirement. This validation approach is concise, efficient, and easy to extend: adding a requirement for special characters would simply mean adding another lookahead like (?=.*[!@#$%]).

Conclusion and Next Steps

Congratulations on mastering conditional matching with lookaheads! You've learned a sophisticated regex technique that dramatically expands your pattern-matching capabilities. You explored positive lookaheads to extract prices only when followed by specific currency codes, capturing clean numeric values without including the context. You discovered negative lookaheads to exclude passwords marked as "REDACTED," filtering out unwanted matches based on what follows them. Most impressively, you combined multiple lookaheads to validate strong passwords, checking for lowercase letters, uppercase letters, and digits simultaneously without caring about their order or position. Lookaheads give you precise control over conditional matching in ways that simple patterns cannot achieve. They're essential for extracting context-dependent data, excluding matches based on surrounding text, and validating complex requirements where multiple conditions must coexist. The validators you built in the first lesson become far more powerful when enhanced with lookaheads: imagine username rules that forbid certain prefixes using negative lookaheads, or email validation that checks for specific domain patterns without consuming them. The practice exercises ahead will challenge you to modify currency extraction patterns, debug negative lookaheads with word boundary issues, strengthen password validation by adding new character requirements, create username validators with forbidden prefixes, and extract priority keywords from sentences based on punctuation context. Get ready to apply these lookahead techniques creatively and build your own conditional matching patterns!

Previous Lesson

Next Lesson: Efficient Processing with Iterators

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal