Welcome back to this second lesson in the "Beyond Basic RAG: Improving our Pipeline" course! In the previous lesson, we explored ways to ensure that your language model stays grounded by responding only with information taken from retrieved context. That approach helps us avoid hallucinations and keeps the output reliable.
In this lesson, we'll improve the pipeline even further by making the retrieval process more iterative. Instead of collecting chunks of text just once before moving on to generation, we'll refine our queries step by step. This multi-stage retrieval can pinpoint the most relevant information and produce a more helpful final context.
Imagine a scenario where a user asks: "Tell me about the regulations for staff members." The question might be too broad. A typical retrieval step might find chunks containing some relevant information, but you might also want to narrow in on "internal policies" or "mandatory forms" for more precision.
Iterative retrieval does exactly that:
- Retrieve an initial chunk based on the user's query.
- Refine that query with a new keyword from the retrieved chunk (e.g.,
"internal"
or"policies"
). - Repeat until you've gathered a set of chunks that thoroughly answers the question—or until improvements level off.
This multi-pass approach can drastically improve the depth and breadth of the retrieved information, making your final context more complete. Below, we'll walk through the building blocks of an iterative retrieval system in detail.
Imagine a user asks, "Tell me about the regulations for staff members." Our DB may include chunks like:
- Chunk 1: "Our company requires that all staff members adhere to internal policies such as punctuality, dress code, and ethical behavior..."
- Chunk 2: "Regulations for staff emphasize adherence to both internal policies and government standards, covering conduct, reporting, ..."
Iteration 1:
- Query: "Tell me about the regulations for staff members"
- Best match: Chunk 1 (score: 0.87)
- Extracted keyword: "internal"
Iteration 2:
- Updated Query: "Tell me about the regulations for staff members internal"
- Best match: Chunk 2 (score: 0.93)
Since further refinement doesn't significantly improve the score, the process stops. The system then uses these accumulated chunks to generate a grounded and comprehensive answer.
The first step is to define a function that fetches the best matching chunk given a query. Take a look at the example below, with additional comments to make it clear:
Python1def retrieve_best_chunk(query_text, collection, n_results=1): 2 """ 3 Retrieve the best matching chunk from the collection based on the given query. 4 Returns: 5 best_chunk_text, best_chunk_score, best_chunk_metadata 6 (or None, None, None if retrieval fails) 7 """ 8 # Perform a similarity search for the provided query 9 retrieval = collection.query(query_texts=[query_text], n_results=n_results) 10 11 # If nothing is found, return None 12 if not retrieval['documents'][0]: 13 return None, None, None 14 15 # Extract the best match from the results 16 best_chunk_text = retrieval['documents'][0][0] 17 best_distance = retrieval['distances'][0][0] 18 19 # Convert 'distance' to a simple similarity score 20 best_chunk_score = 1 / (1 + best_distance) 21 best_chunk_metadata = retrieval['metadatas'][0][0] 22 return best_chunk_text, best_chunk_score, best_chunk_metadata
What's happening here?
- We query our collection (which contains chunks of text stored in a vector database) for the user's
query_text
. - If no chunks are returned, we gracefully exit with
None
. - Otherwise, we pick the top chunk (the first one) and calculate a simple inverted distance as a similarity score. This score is computed as
1 / (1 + best_distance)
: note how this formula transforms the distance metric into a similarity score, where a smaller distance results in a higher similarity score. The addition of1
prevents division by zero. - We return the chunk text and its metadata for further use.
Once you've retrieved a chunk, you often want to refine your query before the next retrieval pass. For instance, maybe the returned chunk contains a useful new keyword—like "internal"
or "procedures"
—that can make the next query more specific. Below is a snippet showing how you can extract a refinement keyword and append it to your query:
Python1def extract_refinement_keyword(chunk_text, current_query): 2 """ 3 Extract a single keyword from the chunk that is not already in the current query. 4 - Ignores stopwords and short words. 5 - Picks the longest remaining candidate. 6 """ 7 # Convert text to lowercase words 8 chunk_words = re.findall(r'\b\w+\b', chunk_text.lower()) 9 query_words = set(re.findall(r'\b\w+\b', current_query.lower())) 10 11 # Filter out stopwords, words that already appear in the query, or very short words 12 candidate_words = [ 13 w for w in chunk_words 14 if w not in STOPWORDS and w not in query_words and len(w) > 4 15 ] 16 17 # If no candidates remain, return an empty string 18 if not candidate_words: 19 return "" 20 21 # Pick the longest candidate word among those left 22 refine_word = max(candidate_words, key=len) 23 return refine_word 24 25def refine_query(current_query, refine_word): 26 """ 27 Append the chosen refine_word to the current query if it exists. 28 """ 29 # If there's no refinement word, do nothing 30 if not refine_word: 31 return current_query 32 # Otherwise, enrich the query with the new keyword 33 return f"{current_query} {refine_word}"
Why is this useful?
- By filtering out stopwords (common words like “the,” “and,” “is” that add little meaning), you reduce noise.
- By skipping words already in the current query, you make sure you're not just repeating the same request.
- By capturing only longer words, you often get more meaningful terms (“policy” vs. “at”).
- Finally,
refine_query
simply appends the chosen keyword, shaping a more specific query for the next retrieval cycle.
Below is a condensed example showing how we can loop through multiple retrieval steps, each time refining the query and checking whether we're truly improving:
Python1def iterative_retrieval(query, collection, steps=3): 2 """ 3 Multi-step retrieval with a simple query refinement approach: 4 1) Retrieve the best chunk for the current query. 5 2) Extract one new keyword from that chunk and add it to the query. 6 3) Repeat until no improvements or keywords are found, 7 or we've reached 'steps' iterations. 8 """ 9 accumulated_chunks = [] 10 current_query = query 11 best_score_so_far = 0.0 12 13 for step in range(steps): 14 # Get the best chunk for the current query 15 best_chunk_text, best_chunk_score, metadata = retrieve_best_chunk(current_query, collection) 16 17 # End if nothing relevant is found or if new chunk score hasn't improved 18 if not best_chunk_text or best_chunk_score <= best_score_so_far + IMPROVEMENT_THRESHOLD: 19 break 20 21 # Update the best score tracker 22 best_score_so_far = best_chunk_score 23 24 # Record the current step result 25 accumulated_chunks.append({ 26 'step': step + 1, 27 'query': current_query, 28 'retrieved_chunk': best_chunk_text, 29 'score': best_chunk_score 30 }) 31 32 # Pick a new keyword from the chunk 33 refine_word = extract_refinement_keyword(best_chunk_text, current_query) 34 35 # Refine the query based on this new keyword 36 current_query = refine_query(current_query, refine_word) 37 38 return accumulated_chunks
Notes on what's happening:
- We limit to a certain number of steps (e.g., 3 iterations).
- In each iteration, we retrieve the best matching chunk.
- We compare its score to the best score we've seen so far. If there's no improvement beyond a small threshold (like 0.02), we assume that refining further might not be worthwhile.
- If we continue, we extract a fresh keyword from the chunk and add it to our query.
- By the end, the list
accumulated_chunks
holds the chunks from each iteration—often providing a richer context than a single retrieval could.
While the iterative retrieval process we've outlined is powerful, it's important to recognize some practical considerations and limitations:
- Heuristics in Use: The choice of the longest word as a refinement keyword and the threshold for improvement are simple heuristics. These can be adjusted and improved based on the specific needs of your application.
- Complexity vs. Performance: More iterations can lead to better context but also increase computational cost. Balancing these is crucial for real-time applications.
- Simplifications: The current approach assumes that a single keyword can significantly refine a query, which might not always be the case. More sophisticated natural language processing techniques could be employed for better refinement.
- Limitations: The current method relies on the quality of the initial retrieval. If the first chunk is not relevant, subsequent iterations may not improve the context significantly.
You have now seen how iterative retrieval can make your RAG pipeline more robust by gradually honing in on the most relevant information. This lesson builds on our previous work about keeping generations grounded—and now you have an even better way to gather the right context in the first place.
Coming up next, you will get hands-on practice implementing and tweaking iterative retrieval strategies. Don't hesitate to experiment with different thresholds, numbers of steps, or query-refinement approaches. Each small tweak can make a big difference in the final performance of your RAG system.
Stay curious, and keep refining! You're making significant strides toward building a complete, high-performing retrieval-augmented generation pipeline.
