Retrieving and Prompt Building in RAG Systems

Introduction

Welcome to the third lesson in our journey through "Scaling Up RAG with Vector Databases"! Well done, you're halway through this course. In the previous lesson, you learned how to split or chunk your text data and store those chunks in a vector database collection. Now, we'll delve into retrieving the most relevant chunks for any given query and building an LLM prompt to produce more accurate, context-driven answers.

Retrieving the Most Relevant Chunks

Before your LLM can generate a coherent, context-rich answer, you need to fetch the right information. Your vector database (for instance, using Chroma) will rank which document chunks are most relevant for a given query.

Let's break down this code in detail:

Function Definition:
- retrieve_top_chunks takes three parameters:
  - query: The user's question or search term;
  - collection: The Chroma collection object containing our embedded documents;
  - top_k: The number of most relevant chunks to retrieve (default is 2).
Vector Search:
- collection.query() function performs a vector-based similarity search to pinpoint which chunks are most aligned with the query.
- query_texts=[query] passes the user's query as a list (Chroma's API expects a list).

Building a Prompt for the LLM

Once you have your relevant chunks, the next step is constructing a prompt that ensures the LLM focuses on precisely those chunks. This helps maintain factual accuracy and a tight context.

Why is this important?

Controlled Context: By explicitly instructing the LLM to focus on the given context, you reduce the probability of hallucinations.
Flexibility: You can modify the prompt format — like adding bullet points or rewording instructions — to direct the LLM's style or depth of response.
Clarity: Including the question upfront reminds the model of the exact query it must address.

We'll be seeing an actual prompt example later in the lesson!

Integrating the Corpus and Creating the Collection

To see this in action, you'll first need to load your corpus data and create a collection in your vector database. This ensures your text chunks are accessible for the retrieval process.

Below is an example of how you might load documents from a JSON file, initialize an embedding model, and create (or retrieve) a collection in your chosen vector database:

Key Details

Embedding Function: Here, SentenceTransformerEmbeddingFunction is used for generating vector representations of your text. You can replace it with another embedding model suited to your needs.
Collection: Instead of manually creating a new collection each time, get_or_create_collection either retrieves an existing one or initializes a fresh collection for you.
Bulk Ingestion: By batching documents, you efficiently add multiple items to your vector database at once.

Querying the Database and Generating Answers

With your collection in place, it's time to retrieve the most relevant chunks and put them to use in your prompt. The snippet below ties everything together: from forming the query, to constructing the prompt, and finally getting the answer from your Large Language Model.

Here's what's happening step by step:

Formulating the Query: We define a query string that reflects the user's question or information request.
Retrieving Chunks: Using retrieve_top_chunks, you get the top five chunks that closely match the query based on semantic similarity.
Prompt Construction: The function build_prompt takes the user's question and the retrieved chunks to assemble a cohesive prompt.
LLM Response: Finally, get_llm_response is called with the constructed prompt, prompting the model to generate a context-informed answer.

By printing both the prompt and the answer, you can debug, refine, and further tailor your approach to retrieval and prompt design.

Examining the Output

Below is an example of the system's final output after retrieving the most relevant chunks and assembling them into a prompt:

In this snippet, the prompt clearly instructs the LLM to focus on the listed chunks. By doing so, the final LLM Answer highlights the key points about recent breakthroughs in renewable energy, healthcare innovations, and sustainable materials, reflecting the relevance of the context. Interestingly, the chunk referencing the Industrial Revolution is not directly invoked in the final answer, showcasing the LLM's ability to select and incorporate only the most suitable context. Notice how each retrieved chunk contributes to a coherent, context-based response, demonstrating how RAG systems help reduce hallucinations and maintain factual alignment.

Conclusion and Next Steps

In this lesson, you discovered how to:

Retrieve the most relevant text chunks in your vector database through semantic similarity.
Construct a well-structured prompt so the LLM stays true to the provided text.

These steps are central to building a robust Retrieval-Augmented Generation pipeline. By creating focused, context-driven prompts, your LLM's responses tend to be more accurate and trustworthy.

Next, you'll have the opportunity to practice and solidify this knowledge. Look for the exercises that follow to test retrieving chunks with different queries, adjusting the prompt format, and experimenting with how the LLM responds. Keep pushing those boundaries — your mastery of RAG systems is well underway!

Previous Lesson

Next Lesson: Metadata-Based Filtering in RAG Systems

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal