Introduction

Welcome to our final lesson in this course about Text Representation Techniques for RAG systems! You’ve already explored the basics of Bag-of-Words (BOW) representations and experimented with sentence embeddings in earlier lessons. Now, we’re going to compare how these two methods differ in actual search scenarios. Think of this as a practical refresher on BOW and embeddings, but with an added focus on side-by-side comparison and deciding which approach might be best for different retrieval use cases.

From Words To Meaning: Why We Need Both Approaches

Before diving into the code, let’s clarify why both methods — from straightforward word matching to deeper semantic modeling — are valuable.

  • Lexical Overlap (BOW): This approach checks for exact word matches, making it easy to interpret how documents are scored. If your query has the phrase "external data," any document containing those exact words gets a higher score. It’s simple, transparent, and efficient for many tasks. But BOW can struggle with synonyms or varying phrasing.

  • Semantic Similarity (Embeddings): Here, we focus on the overall meaning rather than specific words. Two differently phrased sentences can still be close in the embedding space if they convey the same idea. This approach excels at capturing nuances. However, it depends on a trained model and requires more computation.

In some real-world settings, you might even combine both: run a quick lexical match and then refine the results with a more precise semantic model. Let’s see how these methods look in code so you can start comparing results for yourself.

Implementing Bag-of-Words Search

Below is an example of how to implement a BOW-based search workflow. We first build a vocabulary, then vectorize each document and the query according to how often each word appears.

Python
1def bow_vectorize(text, vocab): 2 """ 3 Convert a text into a Bag-of-Words vector by counting how many times 4 each token from our vocabulary appears in the text. 5 """ 6 vector = np.zeros(len(vocab), dtype=int) 7 for word in text.lower().split(): 8 # Remove punctuation for consistency 9 clean_word = word.strip(".,!?") 10 if clean_word in vocab: 11 vector[vocab[clean_word]] += 1 12 return vector 13 14def bow_search(query, docs): 15 """ 16 Rank documents by lexical overlap using the BOW technique. 17 The dot product between the query vector and each document vector 18 indicates how many words they share. 19 """ 20 query_vec = bow_vectorize(query, VOCAB) 21 scores = [] 22 for i, doc in enumerate(docs): 23 doc_vec = bow_vectorize(doc, VOCAB) 24 score = np.dot(query_vec, doc_vec) # Higher score = more overlap 25 scores.append((i, score)) 26 # Sort by descending overlap 27 scores.sort(key=lambda x: x[1], reverse=True) 28 return scores

Let's break this down:

  1. bow_vectorize: Splits the text into words, applies some light cleanup (punctuation removal), and counts occurrences. If “external” appears once in the query, that contributes 1 to the corresponding position in the query vector.
  2. bow_search: Converts the query into a BOW vector, does the same for each document, and uses the dot product to measure shared token counts. Documents with many overlapping terms move to the top of the list.

This method is straightforward and fast for situations when exact word usage is critical. But what if your query is phrased differently than the document’s text? That’s where embeddings shine.

Implementing Embedding-based Search

To tackle the challenge of phrasing differences or synonyms, let’s look at embedding-based search:

Python
1def cos_sim(a, b): 2 """ 3 Compute cosine similarity between two vectors, 4 indicating how similar they are. 5 """ 6 return np.dot(a, b) / (norm(a) * norm(b)) 7 8def embedding_search(query, docs, model): 9 """ 10 Rank documents by comparing how semantically close they are 11 to the query in the embedding space using cosine similarity. 12 """ 13 # Encode both the query and documents into embeddings 14 query_emb = model.encode([query])[0] 15 doc_embs = model.encode(docs) 16 17 scores = [] 18 for i, emb in enumerate(doc_embs): 19 score = cos_sim(query_emb, emb) 20 scores.append((i, score)) 21 # Sort by semantic similarity in descending order 22 scores.sort(key=lambda x: x[1], reverse=True) 23 return scores

In this code snippet:

  1. cos_sim: Implement cosine similarity function to measure how closely two vectors align. Remember, if they point in a similar direction in embedding space, the cosine similarity value is higher.
  2. embedding_search: Converts the query and each document into embedding vectors using the model, then uses cosine similarity to rank how “close” the document is to the query’s meaning.

Here, the retrieval process depends more on interpretive meaning than precise word matching. That means a query about “combining external data with generative models” can find documents discussing “merging external text into RAG systems,” even if some words differ.

Analyzing the Search Output

Finally, let's consider the sample query "How does a system combine external data with language generation to improve responses?" and discuss the corresponding search results for both Bag-of-Words (BOW) and embedding-based methods:

Plain text
1Query: How does a system combine external data with language generation to improve responses? 2 3BOW Search Results: 4 Doc 3 | Score: 5 | Text: Media companies combine external data feeds with digital editing tools to optimize broadcast schedules. 5 Doc 0 | Score: 4 | Text: Retrieval-Augmented Generation (RAG) enhances language models by integrating relevant external documents into the generation process. 6 Doc 4 | Score: 3 | Text: Financial institutions analyze market data and use automated report generation to guide investment decisions. 7 Doc 2 | Score: 2 | Text: By merging retrieved text with generative models, RAG overcomes the limitations of static training data. 8 Doc 5 | Score: 2 | Text: Healthcare analytics platforms integrate patient records with predictive models to generate personalized care plans. 9 Doc 1 | Score: 1 | Text: RAG systems retrieve information from large databases to provide contextual answers beyond what is stored in the model. 10 Doc 6 | Score: 0 | Text: Bananas are popular fruits that are rich in essential nutrients such as potassium and vitamin C. 11 12Embedding-based Search Results: 13 Doc 0 | Score: 0.5939 | Text: Retrieval-Augmented Generation (RAG) enhances language models by integrating relevant external documents into the generation process. 14 Doc 1 | Score: 0.4375 | Text: RAG systems retrieve information from large databases to provide contextual answers beyond what is stored in the model. 15 Doc 2 | Score: 0.4234 | Text: By merging retrieved text with generative models, RAG overcomes the limitations of static training data. 16 Doc 3 | Score: 0.3179 | Text: Media companies combine external data feeds with digital editing tools to optimize broadcast schedules. 17 Doc 4 | Score: 0.2539 | Text: Financial institutions analyze market data and use automated report generation to guide investment decisions. 18 Doc 5 | Score: 0.2015 | Text: Healthcare analytics platforms integrate patient records with predictive models to generate personalized care plans. 19 Doc 6 | Score: 0.0802 | Text: Bananas are popular fruits that are rich in essential nutrients such as potassium and vitamin C.

When using BOW, notice that the ranking primarily hinges on exact keyword matches. For instance, Document 3 is placed at the top solely because the text explicitly contains the words “combine” and “external,” even though Document 0 is arguably more relevant to the query’s intent regarding language models and RAG. Meanwhile, the embedding-based approach ranks Document 0 higher because it captures the semantic relationship between “language generation” and “integrating relevant external documents,” even though some of the words do not match exactly.

By contrast, embeddings allow for more flexibility and deeper understanding, helping it correctly elevate items like Document 0 and Document 1, which are more relevant to the query’s goal, above Document 3. This further illustrates how embeddings can bridge the gap when different but related terms appear in the query and the documents.

Conclusion And Next Steps

In this lesson, we compared a Bag-of-Words-based search with embedding-based semantic search and saw how each method ranks documents differently. BOW is agile for quick, vocabulary-based matches, while embeddings capture deeper connections between words and phrases.

Next, you’ll get hands-on practice implementing these approaches. Have fun!

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal