Securing Agent Responses with Output Guardrails

Introduction & Context

In the previous lessons, you built a solid foundation for securing OpenAI agent workflows in JavaScript. You learned how to securely handle sensitive data using private context objects, monitor agent execution with event listeners, and protect against harmful inputs using input guardrails. Now, you’re ready to implement the final critical layer of your security framework: output guardrails.

While input guardrails protect your agents from problematic user requests, output guardrails serve as your last line of defense by validating what your agents actually generate before those responses reach end users. This is especially important in production applications, where agents might generate content that violates company policies, exposes sensitive information, or includes inappropriate material — even if the input was valid.

Consider real-world scenarios where output guardrails are essential. Your travel assistant might generate a reasonable response to a question about nightlife but inadvertently include references to adult entertainment venues. A customer service agent could accidentally expose internal company information while trying to be helpful. Or a content creation agent might produce material that, while technically responding to an appropriate prompt, crosses boundaries that weren’t anticipated during input validation.

Output guardrails complete your security pipeline by ensuring that every response your agents generate undergoes final validation before reaching users. This creates a comprehensive protection system where you control both what goes into your agents and what comes out of them, giving you confidence to deploy sophisticated AI workflows in production environments.

Understanding Output Guardrails vs Input Guardrails

As a reminder from the previous lesson, input guardrails operate before your agent begins processing, validating user requests and blocking inappropriate inputs before any computational resources are consumed. Output guardrails work at the opposite end of the pipeline: after your agent has produced a response but before that response is returned to the user.

Guardrail type	When it runs	What it protects
Input guardrail	Before agent execution	Prevents unsafe requests from being handled
Output guardrail	After agent execution, before user output	Prevents unsafe responses from reaching users

Together, input and output guardrails form a safety sandwich that protects both ends of your workflow.

How Output Guardrails Work

Output guardrails follow the same structural pattern as input guardrails. Each guardrail exposes an execute function that receives two parameters:

agentOutput: the response generated by your agent (string, structured data, etc.)
context: the shared run context object

The guardrail returns an object describing the validation decision:

outputInfo: additional information about the guardrail decision (e.g., redacted text, warnings)
tripwireTriggered: whether the output should be blocked (true) or allowed (false)

If an output guardrail returns tripwireTriggered: true, the SDK throws an OutputGuardrailTripwireTriggered exception. Your application can catch this exception, inspect the guardrail result, and decide whether to return a redacted response, ask the agent to retry, or show a warning.

Here's a minimal output guardrail structure:

Attaching Output Guardrails to Agents

Just like input guardrails, you attach output guardrails when constructing your agent:

When your agent produces a response, each guardrail in outputGuardrails runs in order. All guardrails must return tripwireTriggered: false for the output to be delivered to the user.

Handling Blocked Outputs

When an output guardrail raises a tripwire, the SDK throws an OutputGuardrailTripwireTriggered exception. Catching this exception lets you log the issue, show a safe fallback to the user, or retry with adjusted instructions.

Creating LLM-Based Output Guardrails

Similar to LLM-powered input guardrails, you can build output guardrails that use another agent to review responses. This “guardrail agent” examines the agent output, decides whether it violates policy, and returns structured guidance.

Observing Guardrail Decisions

When an output guardrail blocks a response, the thrown exception includes the guardrail result. You can access redacted content or remediation details from err.result.output:

Summary & Next Steps

You now know how to secure both sides of your agent workflows. Input guardrails keep unsafe requests out, while output guardrails prevent unsafe responses from getting through. Together with secure context handling and monitoring, you’ve built a complete safety shield around your agents.

In the following exercises, you'll practice converting existing guardrails to operate on outputs, create powerful LLM-based output filters, and attach both input and output guardrails to a single agent for comprehensive protection.

Previous Lesson

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal