Introduction

Welcome to the first lesson of the "Beyond Basic RAG: Improving our Pipeline" course, part of the "Foundations of RAG Systems" course path! In previous courses, you delved into the basics of Retrieval-Augmented Generation (RAG), exploring text representation with a focus on embeddings and vector databases. In this course, we'll embark on an exciting journey to enhance our RAG systems with advanced techniques. Our focus in this initial lesson is on constrained generation, a powerful method to ensure that language model responses remain anchored in the retrieved context, avoiding speculation or unrelated content. Get ready to elevate your RAG skills and build more reliable systems!

Theoretical Foundations of Constrained Generation

When employing large language models (LLMs) in real-world applications, accuracy and fidelity to a trusted dataset are paramount. Even advanced LLMs can produce incorrect or fabricated information — often termed “hallucinations.” This is where constrained generation becomes indispensable. In essence, it is a form of advanced prompt engineering: we carefully craft instructions so the LLM only responds using the retrieved information, or provides disclaimers when insufficient data is found.

By shaping the prompt and enforcing rule-based fallback mechanisms, we instruct the LLM to:

  • Use only the data you supply (the “retrieved context”).
  • Provide disclaimers or refusal messages when context is insufficient.
  • Optionally cite which part of the content it used.

The result is a system less prone to made-up facts and more consistent with the original knowledge source.

Why Constrained Generation Is Important

LLM hallucination can be quite misleading. Imagine a scenario where your application confidently presents policies or regulations not present in your knowledge base. This can create confusion or even compliance issues. With constrained generation:

  • The model remains grounded in the retrieved context only.
  • Uncertain or unavailable information triggers a fallback message like “No sufficient data.”
  • You can require the model to cite lines to verify the source of the answer, building trust with users.
Defining the Constrained Generation Function

We'll start by defining a function that enforces these constraints:

Python
1def generate_with_constraints(query, retrieved_context, strategy="base"): 2 """ 3 Thoroughly enforce model reliance on 'retrieved_context' when answering 'query.' 4 5 The 'strategy' parameter allows for different prompt template variations: 6 1) Base approach: Provide context, instruct LLM not to use outside info. 7 2) Strict approach: Provide context with explicit disclaimers if the answer is not found. 8 3) Citation approach: Provide context, then request the LLM to cite the relevant lines. 9 10 Robust fallback: 11 - If 'retrieved_context' is empty, respond with an apology or neutral statement. 12 - Optionally log each stage for debugging or performance analysis. 13 """ 14 # Provide a safe fallback if no context is retrieved 15 if not retrieved_context.strip(): 16 return ("I'm sorry, but I couldn't find any relevant information.", "No context used.") 17 18 # Choose a prompt template based on strategy 19 if strategy == "base": 20 # Base approach: instruct to use the context and not rely on external info 21 prompt = ( 22 "Use the following context to answer the question in a concise manner.\n\n" 23 f"Context:\n{retrieved_context}\n" 24 f"Question: '{query}'\n" 25 "Answer:" 26 ) 27 elif strategy == "strict": 28 # Strict approach: explicitly disallow info beyond the provided context 29 prompt = ( 30 "You must ONLY use the context provided below. If you cannot find the answer in the context, say: 'No sufficient data'.\n" 31 "Do not provide any information not found in the context.\n\n" 32 f"Context:\n{retrieved_context}\n" 33 f"Question: '{query}'\n" 34 "Answer:" 35 ) 36 elif strategy == "cite": 37 # Citation approach: require references to lines used 38 prompt = ( 39 "Answer strictly from the provided context, and list the lines you used as evidence with 'Cited lines:'.\n" 40 "If the context does not contain the information, respond with: 'Not available in the retrieved texts.'\n\n" 41 f"Provided context (label lines as needed):\n{retrieved_context}\n" 42 f"Question: '{query}'\n" 43 "Answer:" 44 ) 45 # ...

Here's how it works:

  1. If no context was retrieved, the function immediately returns a fallback response.
  2. Different strategies (base, strict, cite) each construct a slightly different prompt. This lets you control how rigidly the model relies on the retrieved context:
    • Base Approach: This strategy provides the retrieved context and instructs the LLM not to use any external information. It is a straightforward method that ensures the model focuses on the given context but allows for some flexibility in interpretation.
    • Strict Approach: This strategy explicitly disallows the use of any information beyond the provided context. If the answer cannot be found within the context, the model is instructed to respond with "No sufficient data." This approach is ideal for scenarios where accuracy and adherence to the provided information are critical.
    • Citation Approach: This strategy requires the model to answer strictly from the provided context and to list the lines used as evidence with "Cited lines:". If the context does not contain the necessary information, the model responds with "Not available in the retrieved texts." This approach is useful for applications where transparency and traceability of the information source are important.
Generating the Final Response

After the prompt is constructed, it's sent to the LLM and the response is parsed. Notice below how we split the text at “Cited lines:” — if present, we separate the answer from the cited lines. If not, the whole response is asserted to be the answer:

Python
1 # ... 2 response = get_llm_response(prompt) 3 4 # Attempt to parse out 'Cited lines:' if present 5 segments = response.split("Cited lines:") 6 if len(segments) == 2: 7 answer_part, used_context_part = segments 8 return answer_part.strip(), used_context_part.strip() 9 else: 10 return response.strip(), "No explicit lines cited."

Let's break down the steps:

  1. The prompt is passed to a helper function, get_llm_response, which queries the language model.
  2. The returned text is scanned for the marker "Cited lines:". If found, the text before it is treated as the main answer and the remainder identified as the cited lines.
Demonstration of Retrieval and Constrained Generation

Below is a typical scenario. We assume documents have been chunked and stored in a vector database. We only show the retrieval steps briefly:

Python
1# Example demonstration of retrieval and constrained generation 2 3# 1. Load and chunk a corpus 4chunked_docs = load_and_chunk_corpus("data/corpus.json") 5 6# 2. Build a collection in a vector database 7collection = build_chroma_collection(chunked_docs, collection_name="corpus_collection") 8 9# 3. Run a sample query 10query = "Highlight the main policies that apply to employees." 11retrieval_results = collection.query(query_texts=[query], n_results=2) 12 13# 4. Construct the retrieved context from top matches 14if not retrieval_results['documents'][0]: 15 retrieved_context = "" 16else: 17 # Join the relevant chunks into one convenient string 18 retrieved_context = "\n".join(["- " + doc_text for doc_text in retrieval_results['documents'][0]]) 19 20# 5. Execute constrained generation function based on chosen strategy 21strategy = "strict" 22answer, used_context = generate_with_constraints(query, retrieved_context, strategy=strategy)

Under the hood:

  1. We load the corpus, build a vector collection, and issue a query.
  2. The top two documents are retrieved, combined, and passed to the constrained generation function.
Practical Example: A Policy FAQ Bot

Consider an HR FAQ bot with access to internal policy documents. When employees ask about vacation rules, the bot retrieves relevant sections from the knowledge base and delivers accurate answers. If a topic isn't documented, it responds with “No sufficient data,” emphasizing that only verified context is used. In scenarios requiring transparency—like citation of specific policy lines—the bot includes references with each response to enhance trust and clarity. The constrained generation function is integrated into the bot's workflow by using the retrieved context to generate responses, ensuring the bot avoids hallucinations and remains grounded in the organization's official policies.

Conclusion and Next Steps

Constrained generation is an essential technique for keeping a RAG system tightly bound to authentic sources. By tailoring prompt instructions and incorporating fallback logic, you reduce the risk of misinformation and ensure answers stay grounded in your retrieved documents.

Next Steps:

  • Experiment with different prompt styles and strategies to tailor the level of strictness or citation detail.
  • Evaluate the behavior of your system by deliberately omitting key context and observing whether it provides the correct fallback responses.
  • Integrate these strategies into broader real-world scenarios and see how well the system maintains accuracy under various user requests.
Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal