Character Classes and Shorthands

Introduction

Welcome to the third lesson of Regex Foundations: Matching Patterns! You've already made substantial progress: you have learned how to match literals and special characters, then mastered quantifiers to control repetition. Now we're ready to tackle one of the most versatile and practical features of regular expressions: character classes. So far, we've used metacharacters like the dot (.) to match any single character, and shorthand notations like \d for digits. While these tools are powerful, they're sometimes too broad or too narrow for specific tasks. What if we need to match only vowels? Or only hexadecimal digits? Or any character except quotes? Character classes give us precise control over which characters we want to match at each position in our pattern. In this lesson, we'll explore how to define custom character sets using square brackets, use ranges to write concise patterns, and leverage shorthand classes for common character types. We'll also discover negated character classes, which match everything except specified characters. By combining these techniques with the quantifiers we learned previously, we'll build patterns that can extract hex color codes, tokenize text, and parse structured data with elegance and efficiency.

Why Character Classes Matter

When working with real-world text, we often encounter situations where we need more precision than "any character" but more flexibility than a specific literal. Consider validating a hexadecimal color code like #FF5733 . We need to match characters that are digits (0-9) or letters (A-F or a-f), but nothing else. The dot operator is too permissive (it would accept #ZZ5733), and listing every valid character individually would be cumbersome. Character classes solve this problem by letting us define sets of acceptable characters. Instead of writing separate alternatives for each possibility, we specify a group of characters within square brackets. The regex engine then matches any single character from that set. This approach combines precision with conciseness, allowing us to express complex matching requirements in compact patterns. Beyond custom sets, we frequently need to match standard categories like digits, word characters, or whitespace. Rather than defining these common sets repeatedly, regex provides shorthand character classes: compact notations that represent frequently used character groups. These shorthands not only save typing but also make patterns more readable and maintainable.

Defining Custom Character Sets

The fundamental syntax for character classes uses square brackets: [...] . Any character placed inside the brackets becomes part of the set, and the regex engine matches exactly one character that appears in that set. For example, [aeiou] matches any single lowercase vowel, while [135] matches the digit 1, 3, or 5. Python import re # Match lowercase vowels only vowel_pattern = r'[aeiou]' text = "The quick brown fox jumps." vowels = re.findall(vowel_pattern, text) import re # Match lowercase vowels only vowel_pattern = r'[aeiou]' text = "The quick brown fox jumps." vowels = re.findall(vowel_pattern, text) The pattern r'[aeiou]' creates a character class containing five specific characters. When the regex engine scans through our text, it checks each character against this set. If the character is a, e, i, o, or u, it's a match; otherwise, the engine moves to the next position. This selective matching allows us to extract exactly the characters we care about while ignoring everything else.

Using Ranges for Efficiency

Writing [0123456789] for matching digits or [abcdefghijklmnopqrstuvwxyz] for lowercase letters would be tedious and error-prone. Fortunately, character classes support ranges using the hyphen character. A range like [a-z] matches any lowercase letter from a to z, while [0-9] matches any digit. Python # Match single digits using a range digit_pattern = r'[0-9]' numbers = "Room 5, Floor 12, Building B." digits = re.findall(digit_pattern, numbers) # Match single digits using a range digit_pattern = r'[0-9]' numbers = "Room 5, Floor 12, Building B." digits = re.findall(digit_pattern, numbers) The pattern r'[0-9]' uses a range to represent all ten digits compactly. The hyphen between 0 and 9 tells the regex engine: "match any character whose position falls between these two endpoints in the character encoding." This works because digits are sequentially encoded (0 has code 48, 1 has code 49, etc.), so [0-9] expands to match all characters in that sequence.

Combining Ranges and Case Sensitivity

Character classes become even more powerful when we combine multiple ranges or individual characters within a single set. We can specify [a-zA-Z] to match both lowercase and uppercase letters, or [0-9a-fA-F] to match hexadecimal digits. Each range or character adds to the set of acceptable matches. Python # Match both uppercase and lowercase letters letter_pattern = r'[a-zA-Z]+' mixed_text = "HelloWorld123" words = re.findall(letter_pattern, mixed_text) # Match both uppercase and lowercase letters letter_pattern = r'[a-zA-Z]+' mixed_text = "HelloWorld123" words = re.findall(letter_pattern, mixed_text) The pattern r'[a-zA-Z]+' combines two ranges within a single character class: a-z for lowercase letters and A-Z for uppercase letters. The + quantifier (which we learned in the previous lesson) extends this to match one or more consecutive letters. This combination lets us extract letter sequences regardless of their case, while excluding digits and other characters.

Practical Application: Extracting Hex Codes

Let's apply what we've learned to a realistic task: extracting two-digit hexadecimal values from text. Hexadecimal notation uses digits 0-9 and letters A-F (case-insensitive) to represent values. Color codes, memory addresses, and byte data often appear in this format, making hex extraction a common text-processing challenge. Pythondef extract_hex_codes(text): # Find all two-digit hexadecimal values (case-insensitive) # Combines character class [0-9a-fA-F] with quantifier {2} pattern = r'[0-9a-fA-F]{2}' return re.findall(pattern, text) sample_hex = "Color codes: #FFAA00, rgb(12, 34, 56), bytes: 7f 2A 0c." print(extract_hex_codes(sample_hex))def extract_hex_codes(text): # Find all two-digit hexadecimal values (case-insensitive) # Combines character class [0-9a-fA-F] with quantifier {2} pattern = r'[0-9a-fA-F]{2}' return re.findall(pattern, text) sample_hex = "Color codes: #FFAA00, rgb(12, 34, 56), bytes: 7f 2A 0c." print(extract_hex_codes(sample_hex)) The pattern r'[0-9a-fA-F]{2}' elegantly captures hex digits by combining three ranges: 0-9 for digits, a-f for lowercase hex letters, and A-F for uppercase hex letters. The {2} quantifier ensures we match exactly two consecutive hex characters. This pattern successfully handles mixed case and finds hex values in various contexts (color codes, RGB values, byte sequences). text['de', 'FF', 'AA', '00', '12', '34', '56', '7f', '2A', '0c']['de', 'FF', 'AA', '00', '12', '34', '56', '7f', '2A', '0c'] Our function extracted ten two-digit hex values from the sample text. Notice how it found FF and AA from the color code #FFAA00, the decimal digits 12, 34, 56 from the RGB function (which happen to be valid hex), and the standalone pairs 7f, 2A, 0c at the end. The pattern handles both uppercase (FF, AA) and lowercase (7f, 0c) letters seamlessly due to our inclusive character class.

Shorthand Character Classes

While custom character classes offer flexibility, certain character sets appear so frequently that regex provides convenient shorthands. These special sequences begin with a backslash and represent commonly needed categories. The three most essential shorthands are \d for digits, \w for word characters, and \s for whitespace. The \d shorthand is equivalent to [0-9], matching any single digit. The \w shorthand matches word characters, which include letters, digits, and the underscore. Its closest ASCII equivalent is [A-Za-z0-9_], but in Python 3, \w goes further by also matching Unicode letters from all languages (such as é, ñ, or ü), making it more inclusive than a manually defined ASCII class. The \s shorthand matches whitespace characters, including spaces, tabs, and newlines. These shorthands save typing and make patterns more readable: comparing \d+ to [0-9]+ shows how conciseness improves clarity.

Comparing Shorthands to Manual Classes

Let's examine how the \w shorthand compares to its manual ASCII equivalent by tokenizing text. Tokenization (splitting text into meaningful units like words) is a fundamental text-processing task, and \w+ provides a compact way to extract word-like tokens. Pythondef find_words(text): # Tokenize using shorthand \w+ (one or more word characters) words_shorthand = re.findall(r'\w+', text) # Compare to manual character class words_manual = re.findall(r'[A-Za-z0-9_]+', text) return words_shorthand, words_manual sample_text = "Tokens: hello_world, Café42, tabs\tand spaces!" w1, w2 = find_words(sample_text) print(w1) print(w2)def find_words(text): # Tokenize using shorthand \w+ (one or more word characters) words_shorthand = re.findall(r'\w+', text) # Compare to manual character class words_manual = re.findall(r'[A-Za-z0-9_]+', text) return words_shorthand, words_manual sample_text = "Tokens: hello_world, Café42, tabs\tand spaces!" w1, w2 = find_words(sample_text) print(w1) print(w2) Both patterns use the + quantifier to match one or more word characters, capturing complete tokens rather than individual characters. As we noted above, \w includes Unicode letters while [A-Za-z0-9_] restricts matches to ASCII characters only. Let's see how this difference plays out in practice. text['Tokens', 'hello_world', 'Café42', 'tabs', 'and', 'spaces'] ['Tokens', 'hello_world', 'Caf', '42', 'tabs', 'and', 'spaces']['Tokens', 'hello_world', 'Café42', 'tabs', 'and', 'spaces'] ['Tokens', 'hello_world', 'Caf', '42', 'tabs', 'and', 'spaces'] The outputs confirm the distinction: \w+ captured Café42 as a single token because \w includes Unicode letters like é. However, the manual pattern [A-Za-z0-9_]+ split this into Caf and 42 because é falls outside the ASCII range specified by [A-Z] and [a-z]. This demonstrates why shorthands are often preferable: they handle international text naturally, making our patterns more robust across different languages and character sets.

Negated Character Classes

Sometimes it's easier to specify what we don't want to match rather than what we do. Negated character classes use a caret (^) immediately after the opening bracket to invert the set. The pattern [^abc] matches any single character except a, b, or c . This technique proves invaluable when extracting content between delimiters or excluding specific characters from matches. Python # Match any character except digits non_digit_pattern = r'[^0-9]+' mixed = "abc123def456" non_digits = re.findall(non_digit_pattern, mixed) # Match any character except digits non_digit_pattern = r'[^0-9]+' mixed = "abc123def456" non_digits = re.findall(non_digit_pattern, mixed) The pattern r'[^0-9]+' creates a negated character class that matches any character except digits (0-9). The + quantifier extends this to match one or more consecutive non-digit characters. This allows us to extract all the letter sequences from mixed alphanumeric text while skipping over the numbers entirely.

Matching Content Between Delimiters

A powerful application of negated character classes is extracting content enclosed within delimiters, such as text inside quotation marks. Rather than trying to match everything until we hit a closing delimiter (which can be complex), we use a negated class to match "any character except the delimiter." Pythondef extract_quoted_content(text): # Match content inside quotes: non-quote chars using negated class # [^"]+ means one or more characters that are not quotes return re.findall(r'"[^"]+"', text) quoted = 'She said "hello world" and "goodbye".' print(extract_quoted_content(quoted))def extract_quoted_content(text): # Match content inside quotes: non-quote chars using negated class # [^"]+ means one or more characters that are not quotes return re.findall(r'"[^"]+"', text) quoted = 'She said "hello world" and "goodbye".' print(extract_quoted_content(quoted)) The pattern r'"[^"]+"' is beautifully elegant: it starts with a literal quote ", then uses [^"]+ to match one or more characters that aren't quotes, and ends with another literal quote ". This approach ensures we capture everything between the delimiters without accidentally matching beyond the closing delimiter. The negated class [^"] is the key: it matches any character (letters, digits, spaces, punctuation) except the quote character itself. text['"hello world"', '"goodbye"']['"hello world"', '"goodbye"'] Our function successfully extracted both quoted strings, including the surrounding quotes. The pattern matched "hello world" and "goodbye" by finding opening quotes, capturing all non-quote characters, and stopping at the closing quotes. This technique generalizes to other delimiters: we could extract content between parentheses using \([^)]+\) or between square brackets using \[[^\]]+\] (note the escaped brackets since they have special meaning in regex).

Conclusion and Next Steps

In this lesson, we've significantly expanded our pattern-matching precision by mastering character classes. We learned to define custom character sets using square brackets, which let us specify exactly which characters to match at any position. We discovered how ranges like [a-z] and [0-9] provide compact notation for sequential characters, and how combining ranges creates powerful patterns like [0-9a-fA-F] for hexadecimal matching. We explored shorthand character classes (\d, \w, \s) that represent common character categories with concise notation. These shorthands not only save typing but also handle Unicode characters appropriately, making our patterns more internationally robust. We also learned about negated character classes using [^...], which match any character except those listed, enabling elegant solutions for delimiter-based extraction and filtering. By combining character classes with the quantifiers from the previous lesson, we can now build sophisticated patterns that extract hex codes, tokenize text, parse quoted strings, and handle countless other real-world text-processing challenges. You've now mastered three fundamental pillars of regex: special characters, quantifiers, and character classes. These tools work together to create flexible, precise patterns that adapt to varying text formats. The practice exercises await to test your skills with vowel hunting, year extraction, hashtag parsing, and bracketed content analysis. Let's put these character classes to work and watch your regex expertise flourish!

Previous Lesson

Next Lesson: Anchors and Grouping Patterns

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal