Introduction: Taking Control of the LLM Interface

In the previous lesson, we explored why building production-ready AI agents requires engineering discipline, not just better prompts. We identified the 70-80% reliability wall and discovered that the solution lies in applying systematic principles to how we build agent systems.

Now we're ready to get practical. In this lesson, we'll cover the first four factors of the 12-Factor Agents methodology — the foundational principles that establish control over how LLMs interact with your system. These factors form the "input-output layer" of agent reliability, defining what the LLM receives (prompts and context) and what it produces (structured outputs and tool calls). The key shift we're making is moving from hoping the LLM does the right thing to explicitly controlling what it sees and what it can produce.

There's no single "right way" to build agents — the patterns we'll discuss are starting points for experimentation. The goal is to give you the control and flexibility to discover what works for your specific use case. Let's dive in!

Factor 1: Natural Language to Tool Calls

The first factor establishes a fundamental principle: translate natural language into predictable, structured outputs that your system can reliably process.

Here's the LLM's real superpower in agent systems: it can translate natural language intents into structured outputs that your software understands. Instead of generating a narrative response, the LLM can produce a formal structure — typically as JSON or a function call — that your deterministic code can execute.

Let's look at a concrete example. Imagine you're building a payment system agent. A user says:

Without Factor 1, you might let the LLM generate a free-form response like this:

This sounds nice, but it doesn't actually do anything. Your code has to parse this text, figure out what the LLM intended, and hope it mentioned all the necessary details.

With Factor 1, you design the agent to produce structured output instead:

Now your code can immediately parse this JSON and execute the actual payment link creation. The LLM has translated the user's natural language into a precise, machine-executable command. This approach creates a clean separation of concerns: the LLM decides what needs to be done, while your regular code knows how to do it.

Beyond Tool Calls: The Full Spectrum of Structured Outputs

Don't let the factor name "Natural Language to Tool Calls" limit your thinking — this principle applies to any structured output that makes the LLM's response predictable. Different use cases call for different output structures.

Final answers with metadata:

Reasoning traces:

Multi-step plans:

The common thread is predictability. Each output follows a structure that your code can parse, validate, and act upon. You're not parsing narrative text and hoping you extracted the right information — you're working with data structures. Different approaches will work better for different problems — the key is choosing a structure that makes sense for your use case and enforcing it consistently.

Why Structured Outputs Matter

Why is structured output better than free-form text?

Testability: You can verify that, given certain inputs, the agent produces the correct output structure. You can write unit tests that check: "When a user requests a payment link, does the agent output the create_payment_link action with the right fields?"

Reliability: Your code doesn't have to guess what the LLM meant. Either you get a valid structure you can execute, or you get an error you can handle.

Debuggability: When something goes wrong, you can see exactly what the LLM tried to communicate. You're not parsing narrative text to figure out intent.

Composability: These structured outputs become building blocks. Your code can chain actions, validate them before execution, or route them to different handlers based on the type.

The key insight is that Factor 1 treats the LLM as a translator from natural language requests to high-level commands. This ensures that AI-driven steps integrate cleanly with your traditional software components, rather than existing as a separate, unpredictable layer.

Factor 2: Own Your Prompts

The second factor addresses a critical but often overlooked aspect of agent development: treat prompt design and management as a first-class part of your codebase.

Remember from the previous lesson how we discussed the problem of untracked prompt changes? Someone edits a prompt to fix one issue, and it silently breaks something else. Without version control, you can't track what changed, when, or why. You can't roll back when something breaks. You can't A/B test improvements.

Many modern agent frameworks come with sophisticated, state-of-the-art prompt engineering built in. This is actually great for getting started quickly — the framework's default prompts might be better than what you'd write on day one. They benefit from extensive testing and optimization by experts, and they can help you move fast initially.

But here's the tradeoff: when you let the framework control your prompts, you lose visibility and control over a critical part of your system. Here's what happens in practice: you start with a framework's defaults and things work reasonably well. But then you need to tweak the behavior — maybe the agent is too verbose, or it's not following your company's tone guidelines, or it's making assumptions you want to change. Now you're stuck reading through documentation, trying to configure the framework's abstraction layer, hoping your change doesn't break something else you can't see.

When something goes wrong, you can't just read the prompt that's actually being used. You have to dig through framework internals or trust that the framework's debugging tools show you everything. This makes problems hard to debug and modifications hard to predict.

Factor 2 says: own your prompts. This means you fully control the prompts given to the LLM — you write them, tune them, version them, and treat them with the same rigor you'd treat any other code. While framework defaults are valuable for getting started fast, production systems need the control and understanding that comes from owning your prompts directly.

What Prompt Ownership Looks Like

Let's understand what "owning" your prompts means in practice:

Write prompts yourself: Whether you start from scratch or copy and adapt from a framework, the prompt text should exist in a form you can directly read and edit. Hand-craft the prompt text for each stage or tool call your agent might use. This gives you complete transparency into what instructions the LLM receives.

Organize and externalize prompts: Keep prompts in separate files (like customer_support_prompt.txt or code_assistant_prompt.md) so they're easy to review and update without altering program flow. Someone should be able to read and understand the prompts without digging through your application code.

Version control your prompts: Store prompts in your repository, just like code. When you change a prompt, that change goes through your normal development workflow — commit, review, test, deploy. This gives you the ability to track what changed, when, and why. You can roll back when something breaks.

Iterate and test: Nobody gets the perfect prompt on the first try. Prompt engineering is creative and iterative. You need the ability to experiment, measure results, and refine over time. Each iteration should be tracked, tested, and reviewed just like any other code change.

Here's how this might look in your codebase:

Your code loads prompts from the prompts/ directory. When you need to change a prompt, you edit the file, commit it, and your tests verify the behavior. This simple structure gives you transparency, version control, and testability.

The Iterative Journey of Prompt Engineering

Imagine you're building an AI coding assistant. You start with this prompt:

This works okay, but users complain that the generated code doesn't follow your team's style guidelines. So you iterate:

This improves things, but now you notice the code isn't using modern libraries. You iterate again:

Here's a truth about prompt engineering: it's creative, iterative work. You start with something that seems reasonable, test it against real cases, discover edge cases where it fails, refine the instructions, and repeat. This is similar to how you'd tune any algorithm or heuristic in traditional software.

Each of these changes should be tracked in version control, tested against your test cases, reviewed by team members, and deployed safely. Without this discipline, prompt changes become dangerous. With proper ownership, prompt engineering becomes a systematic process of continuous improvement.

Owning your prompts means understanding that prompts are executable logic. They determine your agent's behavior just as much as your Python or JavaScript code does. Would you edit production code without version control and review? Of course not. Apply the same discipline to prompts. This ownership gives you the control and understanding needed for production systems, the flexibility to experiment and discover what works, and the safety to deploy changes confidently.

Factor 3: Own Your Context Window

The third factor addresses one of the most critical resources in LLM systems: actively design and optimize every token that goes into the context window to maximize value.

As we discussed in the previous lesson, an LLM's context window is its limited memory of recent text. It includes your prompt, conversation history, retrieved data, and the current user message. This context determines what the AI "knows" when generating its response.

The problem is that context windows have hard limits. GPT-4 might support 8,000 or 128,000 tokens depending on the version, but you can't just keep adding information indefinitely. And even before you hit the limit, you're paying for every token processed, and response times increase with context size.

Most developers think of this as "prompt engineering," but Factor 3 reveals it's much bigger than that. This is context engineering — the active design of everything that goes into the context window. Context engineering is bigger than prompt engineering because it encompasses not just what instructions you give the model, but what data, conversation history, documents, and metadata should be present, how to format that information, when to include different elements, and how to compress and summarize when needed.

Factor 3 says: own your context window. Don't let the framework or model automatically accumulate context. Instead, practice context engineering — explicitly controlling what the AI sees and remembers, optimizing every token to get maximum value.

The Context Accumulation Problem

Let's look at what happens without Factor 3. Imagine a customer support conversation that continues for 13 turns. By turn 13, your context might look like this:

Context ComponentToken Count
System prompt500
Previous 12 messages3,000
Retrieved return policy docs1,500
Current user message50
Available tools list800
Total5,850

You're now at 5,850 tokens before the model even responds. A few more turns and you'll hit the context limit. The framework might start dropping old messages to make room, but now the bot "forgets" important information from earlier in the conversation.

Worse, you're paying for all those tokens on every call. If this conversation continues for 20 turns, you're processing tens of thousands of tokens, most of which aren't relevant to the current question. This is inefficient, expensive, and can actually hurt accuracy by including irrelevant information.

Context Engineering Practices

With Factor 3, you actively engineer what goes into the context window. Here are the key practices:

Select relevant context: For each step, choose what's actually important. If the user is asking about returns, you don't need the entire conversation history — just the current question and maybe the last exchange for continuity.

Inject necessary data: When the user asks about "my order," retrieve that specific order's details and inject them into the context. Don't include all orders or all customer data.

Summarize or compress: For long conversations, create summaries of earlier exchanges. Instead of including 10 previous messages verbatim, include a summary: "User asked about return policy, bot explained 30-day window and condition requirements."

Structure creatively: Format context in ways that help the model reason effectively. Use clear sections, bullet points, or structured data rather than prose.

Experiment beyond standard patterns: You don't necessarily need to follow the standard approach of alternating user and assistant messages. You could inject database results as structured data, provide dynamic system instructions that change based on state, or format information as tables, bullet lists, or JSON — whatever helps the model reason best for your specific use case.

Context engineering is an active, ongoing process. For each agent step, you make decisions about what information is relevant, what external data needs to be retrieved, what can be summarized, and how to structure everything for optimal reasoning. There's no universal "best practice" — it depends on your use case, your model, and what you're trying to optimize for.

Practical Context Engineering

Let's see a concrete example. A user is on turn 5 of a conversation about returning an order. Without Factor 3, your context might include all previous messages verbatim:

That's a lot of tokens for a simple request.

With Factor 3, you actively curate the context:

You've compressed the conversation history into a brief summary and injected only the relevant order details. The model has everything it needs to respond appropriately, but you've reduced token count significantly. This isn't following the standard user/assistant message pattern — you've structured the context in a way that optimizes for this specific interaction.

This explicit management delivers multiple benefits: improved accuracy (the model focuses on relevant information), reduced costs (fewer tokens processed), prevented context overflow (no unexpected token limits), better performance (faster responses), and most importantly, an experimentation space where you can try different approaches and discover what works best for your use case.

Factor 3 recognizes that the context window is a precious, limited resource. By owning it — actively engineering what the model sees — you gain control, improve reliability, and create a powerful space for optimization.

Factor 4: Tools Are Just Structured Outputs

The fourth factor demystifies one of the most talked-about features of modern LLM agents: tool calling is simply an application of Factor 1 — structured outputs with schemas.

When people talk about "tool calling" or "function calling," it can sound like magic — the LLM somehow directly executes code or calls APIs. But Factor 4 reveals what's really happening: the LLM produces a structured output (typically JSON) that represents a tool invocation, and your code handles the actual execution.

This is actually just an extension of Factor 1. Remember, Factor 1 says the LLM should produce structured outputs rather than free-form text. Tool calls are simply a specific type of structured output — one that follows a predefined schema describing available tools. The LLM acts as a producer, generating a data structure, and your software acts as a consumer, executing the requested action.

Anatomy of a Tool Call

Suppose you're building an agent that can check the weather. You define a tool with a schema:

When a user asks "What's the weather in Paris tomorrow?", the LLM doesn't magically call a weather API. Instead, it produces a structured output:

Your code receives this JSON, validates it against the schema, and then executes the actual API call. This is the producer-consumer pattern: the LLM produces a data structure describing what needs to happen, and your software consumes it to perform the action. The LLM never directly touches the weather API — it just outputs a request in the correct format. Of course, tool calls can still fail for reasons like invalid parameters or API errors, so proper error handling is essential—a topic we'll cover later.

This separation makes the system much easier to reason about, test, and debug. When something goes wrong, you can pinpoint whether the issue is in the LLM's tool selection (did it choose the right tool? did it provide valid parameters?) or in your execution code (did the API call fail? was there a network error?).

Why Understanding This Matters

Why is this perspective important?

Clarity and testability: You can test that the LLM produces valid tool calls for various inputs. You can test that your execution code handles those tool calls correctly. These are separate concerns with clear boundaries.

Simplicity of integration: Adding a new tool means defining its schema and writing the execution code. You don't need to teach the LLM how to call APIs — you just describe what the tool does and what parameters it needs.

Safety via sandboxing: The LLM can only request predefined actions. It can't execute arbitrary code or make unexpected API calls. Your execution code acts as a gatekeeper, validating requests before performing any actual operations.

Flexibility in implementation: Different frameworks implement tool calling differently — some use special API parameters, others use specific prompt formats, some use fine-tuned models. But beneath the surface, they're all doing the same thing: getting the LLM to produce structured outputs that match a schema. When you own your prompts (Factor 2) and your context (Factor 3), you can implement tool calling in whatever way makes sense for your system.

The connection to Factor 1 should now be clear: tool calls ARE structured outputs. Factor 1 establishes the principle of structured outputs over free-form text. Factor 4 applies that principle specifically to tool integration, showing how schemas and validation make tool calling reliable and safe.

How These Four Factors Work Together

Now that we understand how each factor works individually — from structured outputs to tool schemas — let's see how they orchestrate together in practice. Imagine you're building a customer support agent, and a customer sends this message:

Picture how the four factors orchestrate a reliable response:

First, Factor 2 (Own Your Prompts) sets the stage. Your carefully versioned prompt instructs the agent: "When handling returns, first look up the order, then verify eligibility, then initiate the return process." The agent isn't improvising — it's following your explicit guidelines.

Next, Factor 3 (Own Your Context Window) provides exactly what's needed. Instead of dumping every order from the past year into the context, your system retrieves just the customer's recent orders from the last 30 days. You structure this as clean, optimized data rather than following a standard message pattern. The agent sees the blue shirt order from November 5th, the tracking number, the delivery date — everything relevant, nothing extraneous.

Then Factor 1 (Natural Language to Tool Calls) kicks in. The agent doesn't respond with "Let me help you with that return!" Instead, it produces this structured output:

Finally, Factor 4 (Tools Are Just Structured Outputs) ensures precision. Your execution code validates this against your lookup_order schema, confirms all required fields are present, executes the database query, and returns the order details. The agent then produces another structured output:

Each step is testable, debuggable, and predictable. If something goes wrong, you know exactly where: Did the agent choose the wrong tool? Did it provide invalid parameters? Did your execution code fail? The four factors create clear boundaries and explicit control points throughout the interaction.

Summary: Your Foundation for Reliable Agents

We've covered the first four factors of the 12-Factor Agents methodology, and these factors form the foundation for everything else we'll build.

Factor 1 (Natural Language to Tool Calls) establishes that the LLM's output should be structured, executable commands — whether that's tool calls, final answers, or reasoning traces.

Factor 2 (Own Your Prompts) treats prompts as first-class code artifacts. While frameworks offer sophisticated defaults that help you start fast, owning your prompts gives you the control and understanding needed for production systems.

Factor 3 (Own Your Context Window) makes context engineering explicit. This is bigger than prompt engineering — it's about optimizing every token, experimenting with different approaches, and not being constrained by standard patterns.

Factor 4 (Tools Are Just Structured Outputs) demystifies tool calling as a producer-consumer pattern, simply applying Factor 1 with schemas.

Together, these four factors establish disciplined control over the LLM interface — you're no longer at the mercy of unpredictable AI behavior. There's no single "right way" to implement these factors. The patterns we've discussed are starting points for experimentation. Your job is to discover what works for your specific use case.

In the upcoming practice exercises, you'll work through scenarios to verify your understanding of these patterns. Looking ahead, the next unit will build on this foundation with Factors 5-8, which cover state management and control flow. But first, let's test what we've learned here!

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal