Securing Agent Responses with Output Guardrails

Introduction & Context

In the previous lessons, you’ve built a solid foundation for securing OpenAI agent workflows in TypeScript. You learned how to securely handle sensitive data using private context objects, monitor agent execution with event listeners, and protect against harmful inputs using input guardrails. Now, you’re ready to implement the final critical layer of your security framework: output guardrails.

While input guardrails protect your agents from problematic user requests, output guardrails serve as your last line of defense by validating what your agents actually generate before those responses reach end users. This is especially important in production applications, where agents might generate content that violates company policies, contains sensitive information, or includes inappropriate material — even if the input was valid.

Consider real-world scenarios where output guardrails are essential. Your travel assistant might generate a reasonable response to a question about nightlife but inadvertently include references to adult entertainment venues. A customer service agent could accidentally expose internal company information while trying to be helpful. Or a content creation agent might produce material that, while technically responding to an appropriate prompt, crosses boundaries that weren’t anticipated during input validation.

Output guardrails complete your security pipeline by ensuring that every response your agents generate undergoes final validation before reaching users. This creates a comprehensive protection system where you control both what goes into your agents and what comes out of them, giving you confidence to deploy sophisticated AI workflows in production environments.

Understanding Output Guardrails vs Input Guardrails

As a reminder from the previous lesson, input guardrails operate before your agent begins processing, validating user requests and blocking inappropriate inputs before any computational resources are consumed. Output guardrails work differently — they execute after your agent has completed its processing and generated a response, but before that response is delivered to the user.

This timing difference is crucial for understanding when and why to use each type of guardrail. Input guardrails are your first line of defense, preventing obviously problematic requests from wasting computational resources or potentially corrupting your agent’s reasoning process. Output guardrails serve as your final quality gate, catching issues that might emerge during the agent’s generation process even when the original input seemed perfectly acceptable.

In multi-agent workflows, output guardrails become even more important because they validate the final output regardless of how many agents were involved in generating it. An agent might receive a clean input, process it appropriately, but still produce output that needs validation due to the complex interactions between different agents or unexpected emergent behaviors in the generation process.

The complementary nature of input and output guardrails means they work together to provide comprehensive protection. Input guardrails prevent bad requests from entering your system, while output guardrails ensure that only appropriate responses leave your system. This dual-layer approach gives you maximum control over your agent’s behavior and helps maintain trust with your users.

Output Guardrail Structure in TypeScript

In TypeScript, output guardrails are implemented as objects with an execute method, rather than as decorated functions. This method is called after the agent generates its output and before the response is delivered to the user.

The execute method receives an object containing the agent’s output and the current run context. It should return an object indicating whether the output should be blocked (tripwireTriggered: true) or allowed (tripwireTriggered: false), along with a human-readable message in outputInfo.

Here’s the basic structure of an output guardrail:

To activate an output guardrail, you attach it to your agent using the outputGuardrails property when constructing the agent. This ensures that every response generated by the agent is validated before being returned to the user.

LLM-Based Output Guardrails with Zod and TypeScript

Just like with input guardrails, you can use an LLM-based agent to validate outputs. In TypeScript, you define the output schema using zod. This schema describes the structure of the guardrail agent’s output.

Here’s how you define the output schema and set up the guardrail agent:

By using z.object to define the output schema, you ensure that the guardrail agent’s response is structured and type-safe. The instructions for the guardrail agent are focused on analyzing the agent’s generated output for policy violations or inappropriate content.

This approach allows you to reuse the same validation logic for both input and output guardrails, with only minor changes to the agent’s instructions.

Implementing Output Guardrail Objects

To implement an output guardrail in TypeScript, you create an object with an execute method. This method is responsible for running the guardrail agent on the agent’s output, interpreting the result, and returning a decision about whether to allow or block the output.

Here’s how you can implement an LLM-based output guardrail, following the provided code structure:

The execute method receives the agent’s output and the current run context. It runs the guardrail agent, passing the output for analysis. The result is then used to decide whether to block or allow the response. If containsProhibitedContent is true, the guardrail blocks the output; otherwise, it allows it through.

Attaching Output Guardrails and Exception Handling

Once you’ve created your output guardrail object, you attach it to your agent using the outputGuardrails property when constructing the agent. This ensures that the guardrail is automatically invoked for every response the agent generates.

Here’s how you attach the output guardrail and handle exceptions:

When an output guardrail determines that a response should be blocked (by setting tripwireTriggered: true), the SDK throws an OutputGuardrailTripwireTriggered exception. You can catch this exception to handle blocked responses gracefully:

This pattern ensures that any inappropriate or policy-violating output is intercepted and never reaches the end user.

Testing Output Guardrail Behavior

Let’s test your complete output guardrail implementation with both inappropriate and appropriate requests to see how the system behaves. Here’s how you can do this:

When you run this test with the inappropriate request, you’ll see output similar to this:

For the appropriate hiking request, you’ll see the guardrail agent’s analysis followed by the travel agent’s actual response:

This demonstrates how output guardrails work seamlessly with legitimate requests while blocking inappropriate content, ensuring that your agent can provide helpful responses while maintaining safety standards.

Summary: Your Complete Agent Security Framework

You’ve now mastered all four layers of comprehensive agent security in the OpenAI Agents SDK for TypeScript. Your security framework includes secure data handling through private context objects, comprehensive workflow monitoring through event listeners, proactive input validation through input guardrails, and final output validation through output guardrails.

The combination of these security mechanisms gives you the confidence to deploy sophisticated AI workflows in real-world applications where safety, compliance, and reliability are paramount. In the upcoming practice exercises, you’ll apply these output guardrail implementation skills to build more complex validation scenarios and explore advanced patterns for protecting your agent systems.

Previous Lesson

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal