Loading...

Introduction

Welcome to the first lesson of the "Beyond Basic RAG: Improving our Pipeline" course, part of the "Foundations of RAG Systems" course path! In previous courses, you delved into the basics of Retrieval-Augmented Generation (RAG), exploring text representation with a focus on embeddings and vector databases. In this course, we'll embark on an exciting journey to enhance our RAG systems with advanced techniques. Our focus in this initial lesson is on constrained generation, a powerful method to ensure that language model responses remain anchored in the retrieved context, avoiding speculation or unrelated content. Get ready to elevate your RAG skills and build more reliable systems!

Theoretical Foundations of Constrained Generation

When employing large language models (LLMs) in real-world applications, accuracy and fidelity to a trusted dataset are paramount. Even advanced LLMs can produce incorrect or fabricated information — often termed “hallucinations.” This is where constrained generation becomes indispensable. In essence, it is a form of advanced prompt engineering: we carefully craft instructions so the LLM only responds using the retrieved information, or provides disclaimers when insufficient data is found.

By shaping the prompt and enforcing rule-based fallback mechanisms, we instruct the LLM to:

Use only the data you supply (the “retrieved context”).
Provide disclaimers or refusal messages when context is insufficient.
Optionally cite which part of the content it used.

The result is a system less prone to made-up facts and more consistent with the original knowledge source.

Why Constrained Generation Is Important

LLM hallucination can be quite misleading. Imagine a scenario where your application confidently presents policies or regulations not present in your knowledge base. This can create confusion or even compliance issues. With constrained generation:

The model remains grounded in the retrieved context only.
Uncertain or unavailable information triggers a fallback message like “No sufficient data.”
You can require the model to cite lines to verify the source of the answer, building trust with users.

Defining the Constrained Generation Function

We'll start by defining a function that enforces these constraints:

Here's how it works:

If no context was retrieved, the function immediately returns a fallback response.
Different strategies (base, strict, cite) each construct a slightly different prompt. This lets you control how rigidly the model relies on the retrieved context:
- Base Approach: This strategy provides the retrieved context and instructs the LLM not to use any external information. It is a straightforward method that ensures the model focuses on the given context but allows for some flexibility in interpretation.

Generating the Final Response

After the prompt is constructed, it's sent to the LLM and the response is parsed. Notice below how we split the text at “Cited lines:” — if present, we separate the answer from the cited lines. If not, the whole response is asserted to be the answer:

Let's break down the steps:

The prompt is passed to a helper function, get_llm_response, which queries the language model.
The returned text is scanned for the marker "Cited lines:". If found, the text before it is treated as the main answer and the remainder identified as the cited lines.

Demonstration of Retrieval and Constrained Generation

Below is a typical scenario. We assume documents have been chunked and stored in a vector database. We only show the retrieval steps briefly:

Under the hood:

We load the corpus, build a vector collection, and issue a query.
The top two documents are retrieved, combined, and passed to the constrained generation function.

Practical Example: A Policy FAQ Bot

Consider an HR FAQ bot with access to internal policy documents. When employees ask about vacation rules, the bot retrieves relevant sections from the knowledge base and delivers accurate answers. If a topic isn't documented, it responds with “No sufficient data,” emphasizing that only verified context is used. In scenarios requiring transparency—like citation of specific policy lines—the bot includes references with each response to enhance trust and clarity. The constrained generation function is integrated into the bot's workflow by using the retrieved context to generate responses, ensuring the bot avoids hallucinations and remains grounded in the organization's official policies.

Conclusion and Next Steps

Constrained generation is an essential technique for keeping a RAG system tightly bound to authentic sources. By tailoring prompt instructions and incorporating fallback logic, you reduce the risk of misinformation and ensure answers stay grounded in your retrieved documents.

Next Steps:

Experiment with different prompt styles and strategies to tailor the level of strictness or citation detail.
Evaluate the behavior of your system by deliberately omitting key context and observing whether it provides the correct fallback responses.
Integrate these strategies into broader real-world scenarios and see how well the system maintains accuracy under various user requests.

Next Lesson: Iterative Retrieval for Enhanced RAG Pipelines

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal