Welcome to our final lesson in this course about Text Representation Techniques for RAG systems! You’ve already explored the basics of Bag-of-Words (BOW) representations and experimented with sentence embeddings in earlier lessons. Now, we’re going to compare how these two methods differ in actual search scenarios. Think of this as a practical refresher on BOW and embeddings, but with an added focus on side-by-side comparison and deciding which approach might be best for different retrieval use cases.
Before diving into the code, let’s clarify why both methods — from straightforward word matching to deeper semantic modeling — are valuable.
-
Lexical Overlap (BOW): This approach checks for exact word matches, making it easy to interpret how documents are scored. If your query has the phrase "external data," any document containing those exact words gets a higher score. It’s simple, transparent, and efficient for many tasks. But BOW can struggle with synonyms or varying phrasing.
-
Semantic Similarity (Embeddings): Here, we focus on the overall meaning rather than specific words. Two differently phrased sentences can still be close in the embedding space if they convey the same idea. This approach excels at capturing nuances. However, it depends on a trained model and requires more computation.
In some real-world settings, you might even combine both: run a quick lexical match and then refine the results with a more precise semantic model. Let’s see how these methods look in code so you can start comparing results for yourself.
Below is an example of how to implement a BOW-based search workflow. We first build a vocabulary, then vectorize each document and the query according to how often each word appears.
Let's break this down:
bow_vectorize
: Splits the text into words, applies some light cleanup (punctuation removal), and counts occurrences. If “external” appears once in the query, that contributes 1 to the corresponding position in the query vector.bow_search
: Converts the query into a BOW vector, does the same for each document, and uses the dot product to measure shared token counts. Documents with many overlapping terms move to the top of the list.
This method is straightforward and fast for situations when exact word usage is critical. But what if your query is phrased differently than the document’s text? That’s where embeddings shine.
To tackle the challenge of phrasing differences or synonyms, let’s look at embedding-based search:
In this code snippet:
cos_sim
: Implement cosine similarity function to measure how closely two vectors align. Remember, if they point in a similar direction in embedding space, the cosine similarity value is higher.embedding_search
: Converts the query and each document into embedding vectors using the model, then uses cosine similarity to rank how “close” the document is to the query’s meaning.
Here, the retrieval process depends more on interpretive meaning than precise word matching. That means a query about “combining external data with generative models” can find documents discussing “merging external text into RAG systems,” even if some words differ.
Finally, let's consider the sample query "How does a system combine external data with language generation to improve responses?"
and discuss the corresponding search results for both Bag-of-Words (BOW) and embedding-based methods:
When using BOW, notice that the ranking primarily hinges on exact keyword matches. For instance, Document 3 is placed at the top solely because the text explicitly contains the words “combine” and “external,” even though Document 0 is arguably more relevant to the query’s intent regarding language models and RAG. Meanwhile, the embedding-based approach ranks Document 0 higher because it captures the semantic relationship between “language generation” and “integrating relevant external documents,” even though some of the words do not match exactly.
By contrast, embeddings allow for more flexibility and deeper understanding, helping it correctly elevate items like Document 0 and Document 1, which are more relevant to the query’s goal, above Document 3. This further illustrates how embeddings can bridge the gap when different but related terms appear in the query and the documents.
In this lesson, we compared a Bag-of-Words-based search with embedding-based semantic search and saw how each method ranks documents differently. BOW is agile for quick, vocabulary-based matches, while embeddings capture deeper connections between words and phrases.
Next, you’ll get hands-on practice implementing these approaches. Have fun!
