Introduction

Welcome to the third lesson of Regex Foundations: Matching Patterns! You've already made substantial progress: you have learned how to match literals and special characters, and then mastered quantifiers to control repetition. Now we're ready to tackle one of the most versatile and practical features of regular expressions: character classes.

So far, we've used metacharacters like the dot (.) to match any single character, and shorthand notations like \d for digits. While these tools are powerful, they're sometimes too broad or too narrow for specific tasks. What if we need to match only vowels? Or only hexadecimal digits? Or any character except quotes? Character classes give us precise control over which characters we want to match at each position in our pattern.

In this lesson, we'll explore how to define custom character sets using square brackets, use ranges to write concise patterns, and leverage shorthand classes for common character types. We'll also discover negated character classes, which match everything except specified characters. By combining these techniques with the quantifiers we learned previously, we'll build patterns that can extract hex color codes, tokenize text, and parse structured data with elegance and efficiency.

Why Character Classes Matter

When working with real-world text, we often encounter situations where we need more precision than "any character" but more flexibility than a specific literal. Consider validating a hexadecimal color code like #FF5733. We need to match characters that are digits (0-9) or letters (A-F or a-f), but nothing else. The dot operator is too permissive (it would accept #ZZ5733), and listing every valid character individually would be cumbersome.

Character classes solve this problem by letting us define sets of acceptable characters. Instead of writing separate alternatives for each possibility, we specify a group of characters within square brackets. The regex engine then matches any single character from that set. This approach combines precision with conciseness, allowing us to express complex matching requirements in compact patterns.

Beyond custom sets, we frequently need to match standard categories like digits, word characters, or whitespace. Rather than defining these common sets repeatedly, regex provides shorthand character classes: compact notations that represent frequently used character groups. These shorthands not only save typing but also make patterns more readable and maintainable.

Defining Custom Character Sets

The fundamental syntax for character classes uses square brackets: [...]. Any character placed inside the brackets becomes part of the set, and the regex engine matches exactly one character that appears in that set. For example, [aeiou] matches any single lowercase vowel, while [135] matches the digit 1, 3, or 5.

The pattern /[aeiou]/g creates a character class containing five specific characters. The g flag makes the regex global, allowing it to find all matches in the string rather than stopping after the first one. When the regex engine scans through our text, it checks each character against this set. If the character is a, e, i, o, or u, it's a match; otherwise, the engine moves to the next position. This selective matching allows us to extract exactly the characters we care about while ignoring everything else.

The pattern found six vowels in our text: the e from "The", the u and i from "quick", two s from "brown" and "fox", and the from "jumps". Every consonant, space, and punctuation mark was ignored because those characters don't appear in our set.

Using Ranges for Efficiency

Writing [0123456789] to match digits or [abcdefghijklmnopqrstuvwxyz] to match lowercase letters would be tedious and error-prone. Fortunately, character classes support ranges using the hyphen character. A range like [a-z] matches any lowercase letter from a to z, while [0-9] matches any digit.

The pattern /[0-9]/g uses a range to represent all ten digits compactly. The hyphen between 0 and 9 tells the regex engine: "match any character whose position falls between these two endpoints in the character encoding." This works because digits are sequentially encoded (0 has code 48, 1 has code 49, etc.), so [0-9] expands to match all characters in that sequence.

The pattern matched three individual digits: 5 from "Room 5", and 1 and 2 from "Floor 12". Notice that the digits from "12" are returned as separate matches (1 and 2) because the character class matches one character at a time. The letter in "Building B" was correctly excluded since it falls outside the range.

Combining Ranges and Case Sensitivity

Character classes become even more powerful when we combine multiple ranges or individual characters within a single set. We can specify [a-zA-Z] to match both lowercase and uppercase letters, or [0-9a-fA-F] to match hexadecimal digits. Each range or character adds to the set of acceptable matches.

The pattern /[a-zA-Z]+/g combines two ranges within a single character class: a-z for lowercase letters and A-Z for uppercase letters. The + quantifier (which we learned in the previous lesson) extends this to match one or more consecutive letters. This combination lets us extract letter sequences regardless of their case, while excluding digits and other characters.

The pattern captured HelloWorld as a single match because all those characters are consecutive letters (both uppercase and lowercase). The digits 123 at the end were excluded since they fall outside both the a-z and A-Z ranges. If we had omitted the + quantifier, each individual letter would have been returned as a separate match instead.

Practical Application: Extracting Hex Codes

Let's apply what we've learned to a realistic task: extracting two-digit hexadecimal values from text. Hexadecimal notation uses digits 0-9 and letters A-F (case-insensitive) to represent values. Color codes, memory addresses, and byte data often appear in this format, making hex extraction a common text-processing challenge.

The pattern /[0-9a-fA-F]{2}/g elegantly captures hex digits by combining three ranges: 0-9 for digits, a-f for lowercase hex letters, and A-F for uppercase hex letters. The {2} quantifier ensures we match exactly two consecutive hex characters. This pattern successfully handles mixed case and finds hex values in various contexts (color codes, byte sequences, standalone values). Note that we return an empty array if match() returns null (when no matches are found), which provides consistent behavior.

Our function extracted seven two-digit hex values from the sample text. It found FF, 57, and 33 from the color code #FF5733, the standalone byte pairs , , , and the single value . The pattern handles both uppercase (, ) and lowercase (, ) letters seamlessly due to our inclusive character class.

Shorthand Character Classes

While custom character classes offer flexibility, certain character sets appear so frequently that regex provides convenient shorthands. These special sequences begin with a backslash and represent commonly needed categories. The three most essential shorthands are \d for digits, \w for word characters, and \s for whitespace.

The \d shorthand is equivalent to [0-9], matching any single digit. The \w shorthand matches word characters, which include ASCII letters, digits, and the underscore; it's equivalent to [A-Za-z0-9_]. The \s shorthand matches whitespace characters, including spaces, tabs, and newlines. These shorthands save typing and make patterns more readable: comparing \d+ to [0-9]+ shows how conciseness improves clarity.

Comparing Shorthands to Manual Classes

Let's examine how the \w shorthand compares to its manual equivalent by tokenizing text. Tokenization (splitting text into meaningful units like words) is a fundamental text-processing task, and \w+ provides a compact way to extract word-like tokens.

Both patterns use the + quantifier to match one or more word characters, capturing complete tokens rather than individual characters. In JavaScript, \w is exactly equivalent to [A-Za-z0-9_]—both match only ASCII letters, digits, and underscores. The shorthand \w+ is simply more concise and conventional, making your patterns easier to read and write.

The outputs are identical because both patterns match the same set of characters. Notice how both split Café42 into Caf and 42 because the accented character é falls outside the ASCII range that \w and cover. This demonstrates an important characteristic of JavaScript regex: the shorthand is limited to ASCII characters. When working with international text containing accented letters or characters from non-Latin alphabets, you may need to define custom character classes or use alternative approaches to capture those characters.

Negated Character Classes

Sometimes it's easier to specify what we don't want to match rather than what we do. Negated character classes use a caret (^) immediately after the opening bracket to invert the set. The pattern [^abc] matches any single character except a, b, or c. This technique proves invaluable when extracting content between delimiters or excluding specific characters from matches.

The pattern /[^0-9]+/g creates a negated character class that matches any character except digits (0-9). The + quantifier extends this to match one or more consecutive non-digit characters. This allows us to extract all the letter sequences from mixed alphanumeric text while skipping over the numbers entirely.

The pattern captured abc and def — the two runs of non-digit characters — while the digit sequences 123 and 456 were excluded. This demonstrates how negated classes let us define matches by exclusion, which is often simpler than listing every character we want to include.

Matching Content Between Delimiters

A powerful application of negated character classes is extracting content enclosed within delimiters, such as text inside quotation marks. Rather than trying to match everything until we hit a closing delimiter (which can be complex), we use a negated class to match "any character except the delimiter."

The pattern /"[^"]+"/g is beautifully elegant: it starts with a literal quote ", then uses [^"]+ to match one or more characters that aren't quotes, and ends with another literal quote ". This approach ensures we capture everything between the delimiters without accidentally matching beyond the closing delimiter. The negated class [^"] is the key: it matches any character (letters, digits, spaces, punctuation) except the quote character itself.

Our function successfully extracted both quoted strings, including the surrounding quotes. The pattern matched "hello world" and "goodbye" by finding opening quotes, capturing all non-quote characters, and stopping at the closing quotes. This technique generalizes to other delimiters: we could extract content between parentheses using /\([^)]+\)/g or between square brackets using /\[[^\]]+\]/g (note the escaped brackets since they have special meaning in regex).

Conclusion and Next Steps

In this lesson, we've significantly expanded our pattern-matching precision by mastering character classes. We learned to define custom character sets using square brackets, which let us specify exactly which characters to match at any position. We discovered how ranges like [a-z] and [0-9] provide compact notation for sequential characters, and how combining ranges creates powerful patterns like [0-9a-fA-F] for hexadecimal matching.

We explored shorthand character classes (\d, \w, \s) that represent common character categories with concise notation. These shorthands save typing and make patterns more conventional and readable. We learned that in JavaScript, \w matches ASCII word characters [A-Za-z0-9_], which is important to remember when working with international text. We also learned about negated character classes using [^...], which match any character except those listed, enabling elegant solutions for delimiter-based extraction and filtering.

By combining character classes with the quantifiers from the previous lesson, we can now build sophisticated patterns that extract hex codes, tokenize text, parse quoted strings, and handle countless other real-world text-processing challenges. You've now mastered three fundamental pillars of regex: special characters, quantifiers, and character classes. These tools work together to create flexible, precise patterns that adapt to varying text formats. The practice exercises await to test your skills with vowel hunting, year extraction, hashtag parsing, and bracketed content analysis. Let's put these character classes to work and watch your regex expertise flourish!

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal