We are now in the fourth and final lesson of this course on Beyond Basic RAG: Improving Our Pipeline! Up to this point, we have explored ways to enhance Retrieval-Augmented Generation (RAG) systems by refining chunking strategies and leveraging advanced retrieval methods. In this lesson, you will learn how to merge a lexical-based retrieval approach (using Okapi BM25) with your existing embedding-based retrieval mechanism, creating a powerful hybrid retrieval pipeline.
By the end of this lesson, you should be able to:
- Grasp the intuition behind Okapi BM25 for lexical retrieval.
- Construct a
BM25
index on your corpus. - Combine
BM25
scores with embedding-based retrieval scores using a configurable weight parameter, alpha.
Within the category of lexical-based search methods, Okapi BM25 is a popular choice. It focuses on the presence of specific keywords, rewarding relevant chunks that contain more occurrences of the query terms. At the same time, it avoids overemphasizing repeated words by incorporating a saturation effect.
A few core ideas behind BM25
:
- Term Frequency (TF): More keyword matches in a chunk can signal higher relevance.
- Document Length Normalization:
BM25
accounts for chunk length, ensuring that very long chunks with many repeated words are not unfairly scored.
Although the underlying formula has several parameters and normalizations, the general purpose is straightforward: favor chunks containing the search terms, but don't let them dominate purely by repeating keywords.
Here is a simple function that builds a BM25
index from your chunked corpus. We assume you already have a collection of text chunks ready.
In this snippet:
- We split chunks into tokens (words) by lowercasing and splitting their text.
- We use
BM25Okapi
to create our lexical index. - Later, we'll score new queries on this index to get relevance.
Below is the first segment of a hybrid_retrieval
function that computes BM25 scores for each chunk and converts distances from an embedding-based search to similarities:
In this code:
- The function signature includes the parameters we need: the user query, the chunk corpus, the BM25 index, the embedding-based collection, how many top results to return (top_k), and the alpha weight.
- We compute BM25 scores by tokenizing the query (lowercasing and splitting), then calling bm25.get_scores().
- Next, we query the embedding-based store to retrieve potential matches.
- We loop through these matches to transform distances into similarities (using a simple 1 / (1 + distance) formula) and store them in a dictionary keyed by chunk index.
Below is the second segment of the same function. It normalizes the BM25 scores, merges them with the similarity values, sorts by the final combined score, and returns the top results:
Here's what's happening in this snippet:
- First, we find the minimum and maximum BM25 scores so we can normalize values into the [0, 1] range.
- For each chunk, we retrieve its BM25 score and its corresponding embedding similarity.
- We compute a final score as the weighted sum of the normalized BM25 score and the embedding similarity, controlled by alpha.
- Finally, we sort all chunks by the final combined score (descending) and return the top_k results.
Below is a brief illustration of how you could integrate the above functions into your pipeline:
Here, we:
- Load or chunk our corpus into manageable pieces.
- Build a BM25 index for lexical retrieval.
- Create an embedding-based collection for semantic retrieval.
- Blend both retrieval methods in one pipeline.
- Inspect the top results by their combined scores.
The alpha parameter determines the balance between lexical and semantic retrieval. Here are some guidelines for choosing its value:
- Higher Alpha (e.g., 0.7 or 0.8): Prioritize exact keyword matches. This is useful when precise terminology is crucial, such as in legal or technical documents.
- Lower Alpha (e.g., 0.3 or 0.2): Emphasize semantic understanding. This is beneficial when the context or meaning is more important than exact wording, such as in creative writing or conversational queries.
- Balanced Alpha (e.g., 0.5): Use when both lexical precision and semantic context are equally important, providing a middle ground.
Experiment with different alpha values to see how they affect retrieval quality in your specific use case.
In this lesson, you explored how to enhance retrieval accuracy by combining Okapi BM25 with embedding-based methods. This approach helps ensure that you do not miss relevant chunks due to subtle differences in word usage or synonyms. You can tune the relative weight (alpha) between exact matching and semantic matching to adapt to different use cases.
In upcoming practice sessions, you will have the opportunity to test various parameters, experiment with different queries, and observe the impact on retrieval quality. Feel free to adjust your chunk sizes, scoring thresholds, or alpha value as you refine your hybrid pipeline. This balanced approach will help you build more robust RAG systems for real-world applications.
