Welcome to this lesson on tokenization and handling Out-of-Vocabulary (OOV) words. Tokenization is a fundamental step in Natural Language Processing (NLP) that involves breaking down text into smaller units called tokens. This process is crucial for AI and Large Language Models (LLMs) as it allows them to understand and process text data effectively.
However, a common challenge in tokenization is dealing with OOV words—words that are not present in the model's vocabulary. Handling these words is essential for maintaining the performance and accuracy of language models. Additionally, text cleaning before tokenization—such as removing unnecessary symbols, handling case sensitivity, and ensuring proper encoding—can significantly improve tokenization quality. Another important aspect is selecting the right model for the language, as some tokenizers are better suited for multilingual text.
Different tokenization methods handle OOV words in distinct ways:
Let's explore how different tokenization methods handle a complex text containing Korean words, emojis, and links. The text we will use is:
Output:
BERT Tokenization Output Explanation:
- Breaks words into subwords using
##
to mark subword units. - Uses
[UNK]
for unknown tokens (e.g., emojis, non-Latin scripts like Korean). - If working with multilingual text, using
bert-base-multilingual-cased
instead ofbert-base-uncased
can significantly improve tokenization accuracy for non-English languages.
Output:
GPT-2 Tokenization Output Explanation:
- No
[UNK]
tokens, as BPE splits unknown words into frequent subword pairs. - Handles emojis, Korean text, and URLs more effectively than WordPiece, but still not optimized for non-English languages.
Output:
T5 Tokenization Output Explanation:
- Uses SentencePiece, a flexible tokenization approach that supports diverse characters.
- Adds ▁ markers to indicate new words.
- Handles non-English text more effectively compared to WordPiece and BPE.
To reduce OOV issues, you can:
- Use Multilingual Tokenizers (
xlm-roberta-base
,bert-base-multilingual-cased
) - Train a Custom Tokenizer (e.g.,
sentencepiece
,BPE
) on domain-specific text - Expand Vocabulary by pretraining on larger datasets
- Ensure Proper Text Cleaning before tokenization (e.g., removing unnecessary symbols, handling casing, ensuring correct encoding)
Output
In this lesson, we explored different tokenization methods and their strategies for handling OOV words. We compared WordPiece, BPE, and SentencePiece tokenizers and discussed how to improve OOV handling. As a next step, practice implementing these tokenization techniques on various text samples, including multilingual data, to better understand their strengths and limitations.
