Introduction to Text Representation: Bag-of-Words model

Introduction

Welcome to the very first lesson of our course "Text Representation Techniques for RAG Systems", part of our "Foundations of RAG Systems" course path! In the first course of this learning path, you learned the fundamentals of RAG, how to structure a simple RAG workflow, and why combining retrieval with generation is so powerful. Now, we'll shift our focus to how we can turn raw text into numerical data — a crucial step if we want our RAG systems to retrieve information accurately and feed it into downstream pipelines; in other words, we'll focus on the indexing component of our RAG pipeline.

In this lesson, our main objectives are:

Understand why we must transform text into a structured format for RAG workflows.
Explore the Bag-of-Words (BOW) method, a simple yet classic text representation technique.

By the end, you'll know how words get mapped into vectors and why these representations matter when building robust retrieval systems.

Why Text Representation Is Essential

RAG systems revolve around retrieving relevant documents based on a user’s query, then generating a final answer. However, computers don’t process language the way humans do; they require structured or numerical forms of text to effectively compare one document with another. Without a proper representation of text, two main issues arise:

We can’t reliably measure how similar one piece of text is to another.
It becomes far more difficult to retrieve accurate, contextually relevant information.

A straightforward solution to this challenge is the Bag-of-Words method. It works by counting how often each word appears, providing a simple numerical snapshot of a document. While this approach ignores the order of words and misses linguistic nuances, it’s an excellent entry point for understanding how to convert messy human language into machine-friendly formats that form the core of RAG systems.

Understanding BOW Model

Let’s explore how Bag-of-Words vectors capture word frequency without considering word order. Consider these three sentences:

“I love machine learning”
“Machine learning is fun”
“I love coding”

To construct our BOW representation, we first gather all unique words to form our vocabulary: {I, love, machine, learning, is, fun, coding}. Each word in the vocabulary maps to an index:

Word	I	love	machine	learning	is	fun	coding
Index	0	1	2	3	4	5	6

With this vocabulary, we can transform each sentence into a numeric vector by counting the occurrences of each word. For example:

Sentence	I	love	machine	learning	is	fun	coding
I love machine learning	1	1	1	1	0	0	0
Machine learning is fun	0	0	1	1	1	1	0
I love coding	1	1	0	0	0	0	1

In these vectors, each column corresponds to a word in the vocabulary, and the numbers indicate how often each word appears in the sentence. This frequency-based representation is a straightforward way to convert text into numerical form for retrieval tasks, providing a foundational understanding of text representation.

Building a Basic Vocabulary

In a BOW approach, the first step is constructing a dictionary (often called a “vocabulary”) that assigns each distinct word an integer index. Here's how you might do it using a function that takes as input a collection of documents:

Let's see what's happening:

We iterate through each text in docs.
We convert the text to lowercase and split it into tokens (words).
We remove punctuation at the start and end of these tokens.
Each cleaned token is added to a unique_words set so we only keep distinct words.
Finally, we sort the collection of unique words to ensure a consistent order and assign each word an index stored in a dictionary.

This vocabulary allows us to look up a word and pinpoint exactly where it should appear in any BOW vector.

Converting Text to Vectors

Once you have a vocabulary, you can create a numeric BOW vector for each text. Every position in the vector corresponds to one word in the vocabulary, and the vector entries indicate how many times each vocabulary word appears in the text. Let's illustrate with code:

We create a np array of zeros, with the length equal to the number of unique words in our vocabulary.
For each cleaned token in our text, we look up its vocabulary index.
We increment the vector at that index, effectively counting the occurrences of each vocabulary word.

Even though BOW ignores word order and deeper linguistic information, it remains a valuable technique for smaller-scale tasks or as a baseline representation. It demonstrates the mechanics of transforming words into arrays of numerical counts, which is essential for retrieval tasks.

Conclusion and Next Steps

In this lesson, you learned why text must be represented numerically for RAG systems and walked through a foundational technique — Bag-of-Words — to convert words into count-based vectors. While BOW has limitations in capturing context, it's an excellent first step in any NLP workflow. Practice building vocabularies and generating BOW vectors with your own text to get comfortable with these concepts.

Up next, we'll explore more advanced methods that preserve the sense and context of your text, such as embeddings derived from language models. In the practice exercises that follow, you'll get hands-on experience coding your own BOW pipeline.

This foundation will set the stage for more powerful semantic retrieval techniques and deeper integration with RAG pipelines later in the course. Good luck, and have fun experimenting!

Next Lesson: Generating and Comparing Sentence Embeddings

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal