Introduction

We are now in the fourth and final lesson of this course on Beyond Basic RAG: Improving Our Pipeline! Up to this point, we have explored ways to enhance Retrieval-Augmented Generation (RAG) systems by refining chunking strategies and leveraging advanced retrieval methods. In this lesson, you will learn how to merge a lexical-based retrieval approach (using Okapi BM25) with your existing embedding-based retrieval mechanism, creating a powerful hybrid retrieval pipeline.

By the end of this lesson, you should be able to:

  1. Grasp the intuition behind Okapi BM25 for lexical retrieval.
  2. Construct a BM25 index on your corpus.
  3. Combine BM25 scores with embedding-based retrieval scores using a configurable weight parameter, alpha.
Understanding the Okapi BM25 Algorithm

Within the category of lexical-based search methods, Okapi BM25 is a popular choice. It focuses on the presence of specific keywords, rewarding relevant chunks that contain more occurrences of the query terms. At the same time, it avoids overemphasizing repeated words by incorporating a saturation effect.

A few core ideas behind BM25:

  • Term Frequency (TF): More keyword matches in a chunk can signal higher relevance.
  • Document Length Normalization: BM25 accounts for chunk length, ensuring that very long chunks with many repeated words are not unfairly scored.

Although the underlying formula has several parameters and normalizations, the general purpose is straightforward: favor chunks containing the search terms, but don't let them dominate purely by repeating keywords.

Building a BM25 Index

Here is a simple function that builds a BM25 index from your chunked corpus. We assume you already have a collection of text chunks ready.

Python
1from rank_bm25 import BM25Okapi 2 3def build_bm25_index(chunks): 4 """ 5 Build a BM25Okapi index from the chunk texts for lexical-based retrieval. 6 BM25 scores typically range in magnitude depending on the corpus. 7 """ 8 # Convert each chunk's text into a list of lowercased tokens 9 corpus = [c["text"].lower().split() for c in chunks] 10 return BM25Okapi(corpus)

In this snippet:

  • We split chunks into tokens (words) by lowercasing and splitting their text.
  • We use BM25Okapi to create our lexical index.
  • Later, we'll score new queries on this index to get relevance.
Merging BM25 and Embedding-Based Retrieval: BM25 Scoring & Similarity Calculation

Below is the first segment of a hybrid_retrieval function that computes BM25 scores for each chunk and converts distances from an embedding-based search to similarities:

Python
1import numpy as np 2 3def hybrid_retrieval(query, chunks, bm25, collection, top_k=3, alpha=0.5): 4 """ 5 Merge BM25 and embedding-based results: 6 1) Compute BM25 scores for each chunk. 7 2) Get embedding distances and convert to similarities. 8 3) Normalize both BM25 and embeddings to [0,1]. 9 4) Combine scores using final_score = alpha * bm25_norm + (1-alpha) * embed_sim. 10 5) Sort by final score descending, and return the top_k results. 11 """ 12 # Tokenize the query for BM25 13 tokenized_query = query.lower().split() 14 bm25_scores = bm25.get_scores(tokenized_query) 15 16 # Query the embedding-based store for candidate chunks 17 embed_results = collection.query(query_texts=[query], n_results=min(top_k * 5, len(chunks))) 18 19 # Convert distances to similarities 20 embed_scores_dict = {} 21 for i in range(len(embed_results['documents'][0])): 22 idx = embed_results['ids'][0][i] 23 distance = embed_results['distances'][0][i] 24 similarity = 1 / (1 + distance) 25 embed_scores_dict[idx] = similarity

In this code:

  • The function signature includes the parameters we need: the user query, the chunk corpus, the BM25 index, the embedding-based collection, how many top results to return (top_k), and the alpha weight.
  • We compute BM25 scores by tokenizing the query (lowercasing and splitting), then calling bm25.get_scores().
  • Next, we query the embedding-based store to retrieve potential matches.
  • We loop through these matches to transform distances into similarities (using a simple 1 / (1 + distance) formula) and store them in a dictionary keyed by chunk index.
Merging BM25 and Embedding-Based Retrieval: Normalizing Scores & Final Ranking

Below is the second segment of the same function. It normalizes the BM25 scores, merges them with the similarity values, sorts by the final combined score, and returns the top results:

Python
1 # Merge scores 2 merged = [] 3 bm25_min, bm25_max = min(bm25_scores), max(bm25_scores) if bm25_scores.size > 0 else (0, 1) 4 for i, chunk in enumerate(chunks): 5 bm25_raw = bm25_scores[i] 6 if bm25_max != bm25_min: 7 bm25_norm = (bm25_raw - bm25_min) / (bm25_max - bm25_min) 8 else: 9 bm25_norm = 0.0 10 11 embed_sim = embed_scores_dict.get(i, 0.0) 12 final_score = alpha * bm25_norm + (1 - alpha) * embed_sim 13 merged.append((i, final_score)) 14 15 # Sort by combined score, highest first 16 merged.sort(key=lambda x: x[1], reverse=True) 17 top_results = merged[:top_k] 18 return [(idx, chunks[idx], score) for (idx, score) in top_results]

Here's what's happening in this snippet:

  • First, we find the minimum and maximum BM25 scores so we can normalize values into the [0, 1] range.
  • For each chunk, we retrieve its BM25 score and its corresponding embedding similarity.
  • We compute a final score as the weighted sum of the normalized BM25 score and the embedding similarity, controlled by alpha.
  • Finally, we sort all chunks by the final combined score (descending) and return the top_k results.
Putting It All Together

Below is a brief illustration of how you could integrate the above functions into your pipeline:

Python
1# Build corpus chunks, BM25 index, and embedding-based store 2chunked_docs = load_and_chunk_corpus(..., 40) 3bm25_index = build_bm25_index(chunked_docs) 4collection = build_chroma_collection(chunked_docs) 5 6# Perform hybrid retrieval 7query = "What do our internal company policies state?" 8results = hybrid_retrieval(query, chunked_docs, bm25_index, collection, top_k=3, alpha=0.6) 9 10# Inspect the results 11if not results: 12 print("No chunks found. You may want to provide a generic response.") 13else: 14 for chunk_idx, chunk_data, final_score in results: 15 print(f"Chunk {chunk_idx} | Score: {final_score:.4f}") 16 print("Text:", chunk_data['text']) 17 print("-----")

Here, we:

  • Load or chunk our corpus into manageable pieces.
  • Build a BM25 index for lexical retrieval.
  • Create an embedding-based collection for semantic retrieval.
  • Blend both retrieval methods in one pipeline.
  • Inspect the top results by their combined scores.
Choosing the Alpha Parameter

The alpha parameter determines the balance between lexical and semantic retrieval. Here are some guidelines for choosing its value:

  • Higher Alpha (e.g., 0.7 or 0.8): Prioritize exact keyword matches. This is useful when precise terminology is crucial, such as in legal or technical documents.
  • Lower Alpha (e.g., 0.3 or 0.2): Emphasize semantic understanding. This is beneficial when the context or meaning is more important than exact wording, such as in creative writing or conversational queries.
  • Balanced Alpha (e.g., 0.5): Use when both lexical precision and semantic context are equally important, providing a middle ground.

Experiment with different alpha values to see how they affect retrieval quality in your specific use case.

Conclusion and Next Steps

In this lesson, you explored how to enhance retrieval accuracy by combining Okapi BM25 with embedding-based methods. This approach helps ensure that you do not miss relevant chunks due to subtle differences in word usage or synonyms. You can tune the relative weight (alpha) between exact matching and semantic matching to adapt to different use cases.

In upcoming practice sessions, you will have the opportunity to test various parameters, experiment with different queries, and observe the impact on retrieval quality. Feel free to adjust your chunk sizes, scoring thresholds, or alpha value as you refine your hybrid pipeline. This balanced approach will help you build more robust RAG systems for real-world applications.

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal