Introduction to Tokenization and OOV Handling

Welcome to this lesson on tokenization and handling Out-of-Vocabulary (OOV) words. Tokenization is a fundamental step in Natural Language Processing (NLP) that involves breaking down text into smaller units called tokens. This process is crucial for AI and Large Language Models (LLMs) as it allows them to understand and process text data effectively.

However, a common challenge in tokenization is dealing with OOV words—words that are not present in the model's vocabulary. Handling these words is essential for maintaining the performance and accuracy of language models. Additionally, text cleaning before tokenization—such as removing unnecessary symbols, handling case sensitivity, and ensuring proper encoding—can significantly improve tokenization quality. Another important aspect is selecting the right model for the language, as some tokenizers are better suited for multilingual text.

How Tokenizers Handle OOV Words

Different tokenization methods handle OOV words in distinct ways:

Tokenizer TypeMethodOOV Handling Strategy
WordPiece (BERT)Subword tokenizationUses [UNK] if no match is found
Byte-Pair Encoding (GPT-2, RoBERTa)Merges frequent character pairsBreaks OOV words into smaller subwords
SentencePiece (T5, mT5, XLM-R)Probabilistic model-basedKeeps rare words but splits them into known subwords
Tokenization with BERT, GPT-2, and T5

Let's explore how different tokenization methods handle a complex text containing Korean words, emojis, and links. The text we will use is:

1. WordPiece Tokenization with BERT

Output:

BERT Tokenization Output Explanation:

  • Breaks words into subwords using ## to mark subword units.
  • Uses [UNK] for unknown tokens (e.g., emojis, non-Latin scripts like Korean).
  • If working with multilingual text, using bert-base-multilingual-cased instead of bert-base-uncased can significantly improve tokenization accuracy for non-English languages.
2. Byte Pair Encoding (BPE) with GPT-2

Output:

GPT-2 Tokenization Output Explanation:

  • No [UNK] tokens, as BPE splits unknown words into frequent subword pairs.
  • Handles emojis, Korean text, and URLs more effectively than WordPiece, but still not optimized for non-English languages.
3. SentencePiece Tokenization with T5

Output:

T5 Tokenization Output Explanation:

  • Uses SentencePiece, a flexible tokenization approach that supports diverse characters.
  • Adds ▁ markers to indicate new words.
  • Handles non-English text more effectively compared to WordPiece and BPE.
Comparison of WordPiece, BPE, and SentencePiece Tokenization
FeatureBERT (WordPiece)GPT-2 (BPE)T5 (SentencePiece)
Handles OOV wordsReplaces with [UNK]Breaks into subwordsSplits into subwords without [UNK]
Emoji Support[UNK]Keeps intactKeeps intact
Non-Latin Text (e.g., Korean)[UNK]Splits into known subwordsKeeps as a whole word
Number HandlingKeeps wholeSplits into sub-tokensSplits into sub-tokens
Hyphenated WordsSometimes splitsOften keeps intactSplits smartly
How to Improve OOV Handling?

To reduce OOV issues, you can:

  • Use Multilingual Tokenizers (xlm-roberta-base, bert-base-multilingual-cased)
  • Train a Custom Tokenizer (e.g., sentencepiece, BPE) on domain-specific text
  • Expand Vocabulary by pretraining on larger datasets
  • Ensure Proper Text Cleaning before tokenization (e.g., removing unnecessary symbols, handling casing, ensuring correct encoding)
Example: Handling Chinese Text with XLM-RoBERTa

Output

Summary and Next Steps

In this lesson, we explored different tokenization methods and their strategies for handling OOV words. We compared WordPiece, BPE, and SentencePiece tokenizers and discussed how to improve OOV handling. As a next step, practice implementing these tokenization techniques on various text samples, including multilingual data, to better understand their strengths and limitations.

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal