Welcome to the very first lesson of our course "Text Representation Techniques for RAG Systems", part of our "Foundations of RAG Systems" course path! In the first course of this learning path, you learned the fundamentals of RAG, how to structure a simple RAG workflow, and why combining retrieval with generation is so powerful. Now, we'll shift our focus to how we can turn raw text into numerical data — a crucial step if we want our RAG systems to retrieve information accurately and feed it into downstream pipelines; in other words, we'll focus on the indexing component of our RAG pipeline.
In this lesson, our main objectives are:
- Understand why we must transform text into a structured format for RAG workflows.
- Explore the Bag-of-Words (BOW) method, a simple yet classic text representation technique.
By the end, you'll know how words get mapped into vectors and why these representations matter when building robust retrieval systems.
RAG systems revolve around retrieving relevant documents based on a user’s query, then generating a final answer. However, computers don’t process language the way humans do; they require structured or numerical forms of text to effectively compare one document with another. Without a proper representation of text, two main issues arise:
- We can’t reliably measure how similar one piece of text is to another.
- It becomes far more difficult to retrieve accurate, contextually relevant information.
A straightforward solution to this challenge is the Bag-of-Words method. It works by counting how often each word appears, providing a simple numerical snapshot of a document. While this approach ignores the order of words and misses linguistic nuances, it’s an excellent entry point for understanding how to convert messy human language into machine-friendly formats that form the core of RAG systems.
Let’s explore how Bag-of-Words vectors capture word frequency without considering word order. Consider these three sentences:
- “I love machine learning”
- “Machine learning is fun”
- “I love coding”
To construct our BOW representation, we first gather all unique words to form our vocabulary: {I, love, machine, learning, is, fun, coding}
. Each word in the vocabulary maps to an index:
With this vocabulary, we can transform each sentence into a numeric vector by counting the occurrences of each word. For example:
In these vectors, each column corresponds to a word in the vocabulary, and the numbers indicate how often each word appears in the sentence. This frequency-based representation is a straightforward way to convert text into numerical form for retrieval tasks, providing a foundational understanding of text representation.
In a BOW approach, the first step is constructing a dictionary (often called a “vocabulary”) that assigns each distinct word an integer index. Here's how you might do it using a function that takes as input a collection of documents:
Python1def build_vocab(docs): 2 unique_words = set() 3 for doc in docs: 4 for word in doc.lower().split(): 5 # Strip punctuation around words (simple approach) 6 clean_word = word.strip(".,!?") 7 if clean_word: 8 unique_words.add(clean_word) 9 # Sort the words to have a consistent, deterministic order 10 return {word: idx for idx, word in enumerate(sorted(unique_words))}
Let's see what's happening:
- We iterate through each text in
docs
. - We convert the text to lowercase and split it into tokens (words).
- We remove punctuation at the start and end of these tokens.
- Each cleaned token is added to a
unique_words
set so we only keep distinct words. - Finally, we sort the collection of unique words to ensure a consistent order and assign each word an index stored in a dictionary.
This vocabulary allows us to look up a word and pinpoint exactly where it should appear in any BOW vector.
Once you have a vocabulary, you can create a numeric BOW vector for each text. Every position in the vector corresponds to one word in the vocabulary, and the vector entries indicate how many times each vocabulary word appears in the text. Let's illustrate with code:
Python1import numpy as np 2 3def bow_vectorize(text, vocab): 4 vector = np.zeros(len(vocab), dtype=int) 5 for word in text.lower().split(): 6 clean_word = word.strip(".,!?") 7 if clean_word in vocab: 8 # Increment the vector slot corresponding to this word 9 vector[vocab[clean_word]] += 1 10 return vector
- We create a
np
array of zeros, with the length equal to the number of unique words in our vocabulary. - For each cleaned token in our text, we look up its vocabulary index.
- We increment the vector at that index, effectively counting the occurrences of each vocabulary word.
Even though BOW ignores word order and deeper linguistic information, it remains a valuable technique for smaller-scale tasks or as a baseline representation. It demonstrates the mechanics of transforming words into arrays of numerical counts, which is essential for retrieval tasks.
In this lesson, you learned why text must be represented numerically for RAG systems and walked through a foundational technique — Bag-of-Words — to convert words into count-based vectors. While BOW has limitations in capturing context, it's an excellent first step in any NLP workflow. Practice building vocabularies and generating BOW vectors with your own text to get comfortable with these concepts.
Up next, we'll explore more advanced methods that preserve the sense and context of your text, such as embeddings derived from language models. In the practice exercises that follow, you'll get hands-on experience coding your own BOW pipeline.
This foundation will set the stage for more powerful semantic retrieval techniques and deeper integration with RAG pipelines later in the course. Good luck, and have fun experimenting!
