Welcome to the third lesson in our journey through "Scaling Up RAG with Vector Databases"! Well done, you're halway through this course. In the previous lesson, you learned how to split or chunk your text data and store those chunks in a vector database collection. Now, we'll delve into retrieving the most relevant chunks for any given query and building an LLM prompt to produce more accurate, context-driven answers.
Before your LLM can generate a coherent, context-rich answer, you need to fetch the right information. Your vector database (for instance, using Chroma
) will rank which document chunks are most relevant for a given query.
Let's break down this code in detail:
-
Function Definition:
retrieve_top_chunks
takes three parameters:query
: The user's question or search term;collection
: The Chroma collection object containing our embedded documents;top_k
: The number of most relevant chunks to retrieve (default is 2).
-
Vector Search:
collection.query()
function performs a vector-based similarity search to pinpoint which chunks are most aligned with the query.query_texts=[query]
passes the user's query as a list (Chroma's API expects a list).n_results=top_k
specifies how many matching chunks to return.
-
Results Structure:
- The query returns a dictionary with multiple keys:
'documents'
: Contains the actual text chunks;'ids'
: Contains the document identifiers;'distances'
: Each result includes a distance, which indicates how semantically close a chunk is to your query — the lower the distance, the better the match.
- Each of these keys maps to a nested list structure:
[[item1, item2, ...]]
.
- The query returns a dictionary with multiple keys:
-
Processing Results:
- For each result, the function creates a dictionary with three key pieces of information:
"chunk"
: The actual text content fromresults['documents'][0][i]
;"doc_id"
: The document identifier fromresults['ids'][0][i]
;"distance"
: The similarity score fromresults['distances'][0][i]
;
- These dictionaries are appended to the
retrieved_chunks
list, which is then ultimately returned.
- For each result, the function creates a dictionary with three key pieces of information:
Once you have your relevant chunks, the next step is constructing a prompt that ensures the LLM focuses on precisely those chunks. This helps maintain factual accuracy and a tight context.
Why is this important?
- Controlled Context: By explicitly instructing the LLM to focus on the given context, you reduce the probability of hallucinations.
- Flexibility: You can modify the prompt format — like adding bullet points or rewording instructions — to direct the LLM's style or depth of response.
- Clarity: Including the question upfront reminds the model of the exact query it must address.
We'll be seeing an actual prompt example later in the lesson!
To see this in action, you'll first need to load your corpus data and create a collection in your vector database. This ensures your text chunks are accessible for the retrieval process.
Below is an example of how you might load documents from a JSON file, initialize an embedding model, and create (or retrieve) a collection in your chosen vector database:
Key Details
- Embedding Function: Here,
SentenceTransformerEmbeddingFunction
is used for generating vector representations of your text. You can replace it with another embedding model suited to your needs. - Collection: Instead of manually creating a new collection each time,
get_or_create_collection
either retrieves an existing one or initializes a fresh collection for you. - Bulk Ingestion: By batching documents, you efficiently add multiple items to your vector database at once.
With your collection in place, it's time to retrieve the most relevant chunks and put them to use in your prompt. The snippet below ties everything together: from forming the query, to constructing the prompt, and finally getting the answer from your Large Language Model.
Here's what's happening step by step:
- Formulating the Query: We define a query string that reflects the user's question or information request.
- Retrieving Chunks: Using
retrieve_top_chunks
, you get the top five chunks that closely match the query based on semantic similarity. - Prompt Construction: The function
build_prompt
takes the user's question and the retrieved chunks to assemble a cohesive prompt. - LLM Response: Finally,
get_llm_response
is called with the constructed prompt, prompting the model to generate a context-informed answer.
By printing both the prompt and the answer, you can debug, refine, and further tailor your approach to retrieval and prompt design.
Below is an example of the system's final output after retrieving the most relevant chunks and assembling them into a prompt:
In this snippet, the prompt clearly instructs the LLM to focus on the listed chunks. By doing so, the final LLM Answer highlights the key points about recent breakthroughs in renewable energy, healthcare innovations, and sustainable materials, reflecting the relevance of the context. Interestingly, the chunk referencing the Industrial Revolution is not directly invoked in the final answer, showcasing the LLM's ability to select and incorporate only the most suitable context. Notice how each retrieved chunk contributes to a coherent, context-based response, demonstrating how RAG systems help reduce hallucinations and maintain factual alignment.
In this lesson, you discovered how to:
- Retrieve the most relevant text chunks in your vector database through semantic similarity.
- Construct a well-structured prompt so the LLM stays true to the provided text.
These steps are central to building a robust Retrieval-Augmented Generation pipeline. By creating focused, context-driven prompts, your LLM's responses tend to be more accurate and trustworthy.
Next, you'll have the opportunity to practice and solidify this knowledge. Look for the exercises that follow to test retrieving chunks with different queries, adjusting the prompt format, and experimenting with how the LLM responds. Keep pushing those boundaries — your mastery of RAG systems is well underway!
