Welcome to the final lesson of Codex Subagents & Multi-Agent Orchestration! Over the past three lessons, we have established the foundations of reliable subagent automation: strict contracts for predictable output, parallel orchestration for concurrent work, and gated pipelines for safe code modifications. Each pattern addressed specific challenges in building production-ready automation.
This capstone lesson brings these concepts together into a comprehensive, production-grade orchestration script. We will build a multi-agent quality audit system that runs parallel analyses across multiple package directories, implements robust retry logic, enforces failure thresholds, and generates both human-readable reports and machine-parsable summaries. By the end, you will understand how to combine reliability patterns with observability practices to create maintainable, debuggable automation workflows.
When operating multiple subagents in production, two concerns become paramount: reliability and observability. Reliability means the system degrades gracefully when individual agents fail, rather than cascading into total failure. Observability means we can understand what happened during execution, debug failures, and audit agent behavior after the fact.
Achieving reliability requires several strategies: timing operations to detect performance issues, retrying failed agents to handle transient errors, and enforcing thresholds so that widespread failures halt the pipeline rather than producing unreliable output. Observability demands comprehensive logging, structured artifacts for both humans and machines, and clear failure diagnostics.
These principles shape every design decision in our capstone script. We will instrument agent execution with timing data, persist detailed logs for each agent, aggregate results into multiple report formats, and validate overall quality before declaring success.
Our capstone orchestrator follows a specific workflow:
- Discover work scopes by listing package directories in a monorepo structure.
- Execute agents (one per scope), capturing timing, output, and errors.
- Retry once when an agent fails to produce valid output (to handle transient failures).
- Aggregate results into:
- a
Markdownreport for human review - a
JSONsummary for programmatic processing
- a
- Enforce quality gates by calculating the failure rate across all agents and halting when it exceeds a threshold.
This architecture balances concurrency and speed (easy to parallelize) with observability (logs + artifacts) and reliability (retries + thresholds).
We begin by identifying which directories contain code that needs auditing. Sort the scopes for deterministic runs (important for CI reproducibility and stable artifact diffs):
This function scans the packages/ directory and returns a sorted list of subdirectory paths, each with a trailing slash. The trailing slash convention creates unambiguous scope boundaries for agents: packages/common/ means the agent should analyze only files under that path.
The core execution function combines subprocess invocation with timing instrumentation:
We capture the start time before launching the subagent, then measure the elapsed time in milliseconds after completion. This metric reveals performance outliers: if one agent takes significantly longer than others, it might indicate code complexity or an agent struggling with its task.
We use -a never for non-interactive execution and -s read-only for a read-only sandbox. As in earlier units, -a never is an approval policy, not what makes the run read-only; the read-only guarantee here comes from the sandbox mode.
After execution completes, we persist comprehensive logs:
Each log file captures the agent’s scope, exit code, standard output, and standard error. These logs serve multiple purposes: they help debug failures, provide audit trails for what each agent observed, and preserve error messages when contract extraction fails.
Having persistent logs means that even if the orchestrator crashes, you can inspect partial results and understand which agents completed successfully.
As in the previous lessons, exec --json returns a JSONL event stream on stdout, so we reuse the same helper to extract the final contract object from the completed agent_message event. We also keep the same defensive pattern: if extraction fails, return a well-formed failure result instead of crashing the whole orchestration.
Production note: Another Codex option you may encounter is --output-schema. This flag lets you define the expected shape of the agent’s final response, which can make downstream parsing more reliable. It solves a different problem than --json: --json makes the CLI output machine-readable as structured JSONL events, while --output-schema constrains the final model payload itself. In production, you may use both together: --json for structured event transport and --output-schema for stronger response-shape guarantees. We still keep defensive parsing in this capstone because a reliable orchestrator should handle malformed, missing, or incomplete outputs gracefully even when stronger schema controls are available. If you choose to use it, --output-schema would be added to the codex exec invocation alongside --json, before the prompt argument.
In this capstone, the new emphasis is on making every result carry orchestration metadata (runtime_ms, exit_code, and ) so retries, reporting, and failure-threshold checks can work from a uniform record shape.
The audit_scope function combines task definition with retry logic. First, we define a strict contract that matches earlier lessons, including the consistent shape for files_modified (["string"], not a literal []). For read-only audits, the value must be []:
We generate a deterministic log filename by transforming the scope path into a safe filename format. That naming convention becomes important later when you correlate summary.json entries with individual log files.
After the initial execution, we implement simple retry logic:
If the first attempt fails, we retry up to RETRY_ON_FAIL times (set to 1 at the module level). The retry uses a different log filename so that we preserve both attempts, and numbered retry suffixes (_retry1.log, , ...) scale cleanly if you later increase the retry count. This is the key capstone reliability pattern: tolerate transient failures without losing evidence from the original run.
After collecting results from all agents, we synthesize them into a Markdown report:
We start by calculating summary statistics: the total agent count and the failure count. The main findings section iterates through successful results, creating a subsection for each scope. Each subsection includes the agent’s summary and a bulleted list of findings.
The report concludes with a failures section for quick triage:
By separating successful findings from failures, you make it easy to prioritize: fix the failures first to get complete coverage, then address the quality issues identified by successful agents.
Alongside the human-readable Markdown, we generate structured JSON for programmatic consumption:
This JSON file contains the complete results array, preserving all fields from each agent: status, summary, findings, files_read, files_modified, timing metrics, exit_code, and scope. Downstream tools can parse this file to generate dashboards, track quality trends over time, or integrate with CI/CD systems.
When a run is incomplete or suspicious, use the artifacts in this order:
-
Open
summary.json- Filter for failed results: entries where
status == "failed". - Note each failing
scope,summary, andexit_code.
- Filter for failed results: entries where
-
Open the corresponding per-scope log
- Convert scope to filename the same way the script does:
packages/foo/→agent_packages_foo.log- first retry →
agent_packages_foo_retry1.log - later retries increment the number (
agent_packages_foo_retry2.log, etc.)
- Check:
--- stdout ---for malformed/non-contract output--- stderr ---for runtime/tool errorsexit=to confirm whether it was a tool failure vs. contract failure
- Convert scope to filename the same way the script does:
-
Differentiate failure types
- Contract extraction failure: no , malformed , or non-object payload
The main execution block ties everything together with proper artifact management:
We create a timestamped output directory, ensuring each run produces independent artifacts that do not overwrite previous results. This design supports historical analysis: you can compare quality reports across time to track improvements or regressions.
The list comprehension executes agents sequentially; you can replace it with a parallel pattern (as in Lesson 2) without changing the artifact or gating model.
After orchestration completes, we validate overall quality:
We count how many agents failed, compute the failure rate, and halt if that rate exceeds our threshold (30% by default). The error message includes both the fraction and the computed percentage to make it obvious that the gate is ratio-based.
We have built a production-grade orchestration system that embodies the principles explored throughout this course. The script combines contract-based subagents, failure containment with retries and defaults, artifact-first observability (logs + summaries), and failure-rate gating to prevent unreliable outputs from being trusted.
The patterns here scale beyond quality audits. You can adapt this architecture for test generation across packages, security scanning, documentation validation, or any task that benefits from concurrent analysis with centralized reporting. The key principles remain: instrument everything, log comprehensively, retry intelligently, aggregate systematically, and validate rigorously.
Now, it is your turn to bring these concepts to life! The upcoming practice exercises will challenge you to extend this capstone with additional reliability mechanisms, experiment with different failure strategies, and build your own multi-agent orchestration systems from the ground up.
