Welcome to the third lesson in our journey through "Scaling Up RAG with Vector Databases"! Well done, you're halfway through this course. In the previous lesson, you learned how to split or chunk your text data and store those chunks in a vector database collection. Now, we'll delve into retrieving the most relevant chunks for any given query and building an LLM prompt to produce more accurate, context-driven answers.
Let's start by understanding how to fetch the most relevant information from your vector database, which is the foundation for effective retrieval-augmented generation.
Before your LLM can generate a coherent, context-rich answer, you need to fetch the right information. Your vector database (for instance, using Chroma
) will rank which document chunks are most relevant for a given query. Let's explore how this is achieved with the retrieve_top_chunks
function.
In this initial part of the function, we use the SentenceEmbedder
to convert the query text into a dense vector representation. This embedding is crucial as it allows us to perform a similarity search in the vector space, comparing the query against the stored document embeddings.
Here, we define the QueryOptions
struct, specifying that we want to use the query embeddings for the search. We also set n_results
to top_k
, indicating the number of top results we wish to retrieve. The include
field specifies that we want both the document texts and their distances (similarity scores) in the results. The query
method of the collection
is then called with these options, returning the most relevant document chunks. We also handle the case where no documents are returned by checking if the documents
field is None
or empty, allowing for an early return.
In this final section, we process the query results. We iterate over the documents and their corresponding distances, creating a RetrievedChunk
struct for each document containing the document text, an index-based document ID, and the distance. These chunks are collected into a vector, which is returned as the function's result.
This function is essential for fetching the most relevant information for your query, ensuring that the LLM has the right context to generate accurate and context-driven answers.
Now that you know how to retrieve the best-matching chunks, let's see how to use them to build a prompt that guides your LLM to generate focused and reliable answers.
Once you have your relevant chunks, the next step is constructing a prompt that ensures the LLM focuses on precisely those chunks. This helps maintain factual accuracy and a tight context. We'll add this function in llm.rs
.
Why is this important?
- Controlled Context: By explicitly instructing the LLM to focus on the given context, you reduce the probability of hallucinations.
- Flexibility: You can modify the prompt format — like adding bullet points or rewording instructions — to direct the LLM's style or depth of response.
- Clarity: Including the question upfront reminds the model of the exact query it must address.
We'll be seeing an actual prompt example later in the lesson!
With your prompt-building strategy in place, the next step is to ensure your data is loaded and your vector database collection is ready for retrieval.
To see this in action, you'll first need to load your corpus data and create a collection in your vector database. This ensures your text chunks are accessible for the retrieval process.
Below is an example of how you might load documents from a JSON file, initialize an embedding model, and create (or retrieve) a collection in your chosen vector database:
Here are the key parts to remember:
- Embedding Function: The
build_chroma_collection
function uses the providedembedder
(aSentenceEmbedder
instance) to generate dense vector representations for each document chunk. - Collection Creation:
build_chroma_collection
either creates a new ChromaDB collection or retrieves an existing one named"full_document_collection"
. - Batch Ingestion: The function ingests all document chunks into the collection in a single batch for efficiency.
With your collection set up and ready, let's move on to the process of querying your database and generating answers using your LLM.
With your collection in place, it's time to retrieve the most relevant chunks and put them to use in your prompt. The snippet below ties everything together: from defining the query, to constructing the prompt, and finally getting the answer from your Large Language Model.
Here's what's happening step by step:
- Formulating the Query: We define a query string that reflects the user's question or information request.
- Retrieving Chunks: Using
retrieve_top_chunks
, you get the top three chunks that closely match the query based on semantic similarity. - Prompt Construction: The function
build_prompt
takes the user's question and the retrieved chunks to assemble a cohesive prompt. - LLM Response: Finally,
get_llm_response()
is invoked with the constructed prompt to produce an answer.
By printing both the prompt and the answer, you can debug, refine, and further tailor your approach to retrieval and prompt design.
Now, let's take a closer look at what the output actually looks like and how the retrieved context shapes the LLM's response.
Below is an example of the system's final output after retrieving the most relevant chunks and assembling them into a prompt:
In this snippet, the prompt clearly instructs the LLM to focus on the listed chunks. By doing so, the final LLM Answer highlights the key points about recent breakthroughs in renewable energy, healthcare innovations, and sustainable materials, reflecting the relevance of the context. Interestingly, the chunk referencing the Industrial Revolution is not directly invoked in the final answer, showcasing the LLM's ability to select and incorporate only the most suitable context. Notice how each retrieved chunk contributes to a coherent, context-based response, demonstrating how RAG systems help reduce hallucinations and maintain factual alignment.
Having seen the end-to-end process and its output, let's summarize what you've learned and look ahead to the next steps.
In this lesson, you discovered how to:
- Retrieve the most relevant text chunks in your vector database through semantic similarity.
- Construct a well-structured prompt so the LLM stays true to the provided text.
These steps are central to building a robust Retrieval-Augmented Generation pipeline. By creating focused, context-driven prompts, your LLM's responses tend to be more accurate and trustworthy.
Next, you'll have the opportunity to practice and solidify this knowledge. Look for the exercises that follow to test retrieving chunks with different queries, adjusting the prompt format, and experimenting with how the LLM responds. Keep pushing those boundaries — your mastery of RAG systems is well underway!
