Tokenization and Out-of-Vocabulary (OOV) Handling in NLP

Introduction to Tokenization and OOV Handling

Welcome to this lesson on tokenization and handling Out-of-Vocabulary (OOV) words. Tokenization is a fundamental step in Natural Language Processing (NLP) that involves breaking down text into smaller units called tokens. This process is crucial for AI and Large Language Models (LLMs) as it allows them to understand and process text data effectively.

However, a common challenge in tokenization is dealing with OOV words—words that are not present in the model's vocabulary. Handling these words is essential for maintaining the performance and accuracy of language models. Additionally, text cleaning before tokenization—such as removing unnecessary symbols, handling case sensitivity, and ensuring proper encoding—can significantly improve tokenization quality. Another important aspect is selecting the right model for the language, as some tokenizers are better suited for multilingual text.

How Tokenizers Handle OOV Words

Different tokenization methods handle OOV words in distinct ways:

Tokenizer Type	Method	OOV Handling Strategy
WordPiece (BERT)	Subword tokenization	Uses `[UNK]` if no match is found
Byte-Pair Encoding (GPT-2, RoBERTa)	Merges frequent character pairs	Breaks OOV words into smaller subwords
SentencePiece (T5, mT5, XLM-R)	Probabilistic model-based	Keeps rare words but splits them into known subwords

Tokenization with BERT, GPT-2, and T5

Let's explore how different tokenization methods handle a complex text containing Korean words, emojis, and links. The text we will use is:

1. WordPiece Tokenization with BERT

Output:

BERT Tokenization Output Explanation:

Breaks words into subwords using ## to mark subword units.
Uses [UNK] for unknown tokens (e.g., emojis, non-Latin scripts like Korean).
If working with multilingual text, using bert-base-multilingual-cased instead of bert-base-uncased can significantly improve tokenization accuracy for non-English languages.

2. Byte Pair Encoding (BPE) with GPT-2

Output:

GPT-2 Tokenization Output Explanation:

No [UNK] tokens, as BPE splits unknown words into frequent subword pairs.
Handles emojis, Korean text, and URLs more effectively than WordPiece, but still not optimized for non-English languages.

3. SentencePiece Tokenization with T5

Output:

T5 Tokenization Output Explanation:

Uses SentencePiece, a flexible tokenization approach that supports diverse characters.
Adds ▁ markers to indicate new words.
Handles non-English text more effectively compared to WordPiece and BPE.

Comparison of WordPiece, BPE, and SentencePiece Tokenization

Feature	BERT (WordPiece)	GPT-2 (BPE)	T5 (SentencePiece)
Handles OOV words	Replaces with `[UNK]`	Breaks into subwords	Splits into subwords without `[UNK]`
Emoji Support	`[UNK]`	Keeps intact	Keeps intact
Non-Latin Text (e.g., Korean)	`[UNK]`	Splits into known subwords	Keeps as a whole word
Number Handling	Keeps whole	Splits into sub-tokens	Splits into sub-tokens
Hyphenated Words	Sometimes splits	Often keeps intact	Splits smartly

How to Improve OOV Handling?

To reduce OOV issues, you can:

Use Multilingual Tokenizers (xlm-roberta-base, bert-base-multilingual-cased)
Train a Custom Tokenizer (e.g., sentencepiece, BPE) on domain-specific text
Expand Vocabulary by pretraining on larger datasets
Ensure Proper Text Cleaning before tokenization (e.g., removing unnecessary symbols, handling casing, ensuring correct encoding)

Example: Handling Chinese Text with XLM-RoBERTa

Output

Summary and Next Steps

In this lesson, we explored different tokenization methods and their strategies for handling OOV words. We compared WordPiece, BPE, and SentencePiece tokenizers and discussed how to improve OOV handling. As a next step, practice implementing these tokenization techniques on various text samples, including multilingual data, to better understand their strengths and limitations.

Previous Lesson

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal