Pausing and Resuming Agents Through API Calls

Introduction: Adding Lifecycle Control to Agent APIs

In the previous lesson, you implemented database persistence and progress callbacks that save the agent's state after each step. This allows clients to monitor progress in real time, but they still have no control over the agent once it starts running. Now it's time to complete Factor 6 — Launch / Pause / Resume with simple APIs. You already have the launch endpoint from lesson one and the durable persistence layer from lesson two; in this lesson, you add the pause and resume endpoints that give users the power to stop expensive computations temporarily or correct mistakes without losing progress. You will implement these endpoints using a cooperative pause mechanism that lets agents stop gracefully between steps.

How the Pause Mechanism Works

Unlike forcefully terminating a thread or process, a cooperative pause lets the agent finish its current step before stopping. This approach works by having the pause endpoint change the database status field to "paused" and requiring the agent's progress callback to check for this status change after each step completes.

When the callback detects that the status has been changed externally, it updates the local state object to reflect the pause, causing the agent's main loop to exit on its next iteration. The advantage of this design is that it never interrupts the agent mid-step, ensuring the agent's state remains consistent and valid, with all fields properly synchronized. This is exactly what Factor 6 calls for: the agent can checkpoint, wait, and resume reliably across time.

Modifying the Progress Callback to Detect Pause Requests

As a reminder from the previous lesson, the progress callback is called after each agent step to save the current state to the database. To support pause detection, you need to modify this callback to check whether the database status was changed to "paused" by an external request:

The modified callback first queries the database to load the current record and then checks whether db_state.status equals "paused". If it does, the callback recognizes that an external request changed the status and updates the local state.status to "paused" as well. This status change causes the agent's main loop condition, while state.status == "running", to evaluate to on the next iteration, ending the workflow.

Implementing the Pause Endpoint

With the progress callback ready to detect pause requests, you can create the pause endpoint that sets the signal. First, define a Pydantic model for the pause request payload that contains only the id field, as that is all you need to identify which agent to pause:

This simple model validates that incoming requests include the required id field, following the same pattern you used for the launch request in previous lessons.

Creating the Pause Route Handler

Now, you can implement the endpoint handler that validates the request and sets the pause signal in the database:

The endpoint opens a database session and queries for the record matching the provided id, returning a 404 error if no such state exists. Before setting the pause status, the endpoint validates that the agent is actually running — you cannot pause an agent that is already complete, failed, or paused. This validation prevents confusing state transitions and provides clear error messages to the client. If validation passes, the endpoint sets db_state.status = "paused", and the context manager commits the change when the block exits. The progress callback running in the background task will then detect this upon its next invocation.

Adapting the Background Runner for Resume Support

To support resuming paused workflows, you need to modify the background task function to accept an optional pre-loaded state. This allows the resume endpoint to load the state once and pass it to the background task, rather than requiring the task to load it again:

The function _run_agent_in_background now accepts working_state as an Optional[State] parameter. When it is None, the function loads the state from the database just as before, converting it to a Pydantic model and setting the status to "running". When working_state is provided (meaning that this is a resume operation), the function uses that state directly but still opens a to ensure the database status is set to . The context manager commits this automatically, which prevents race conditions where the database might still show while the agent is actually executing. The rest of the function remains unchanged from the previous lesson.

Implementing the Resume Endpoint

With the background runner adapted to accept pre-loaded states, you can implement the resume endpoint. First, define the request model that only needs the state id:

This follows the same simple pattern as the PauseRequest, keeping the API interface consistent and straightforward.

Creating the Resume Route Handler

Now, implement the endpoint handler that validates the state and restarts the background task:

The endpoint loads the state from the database and performs validations before resuming. It checks whether the agent is already running and returns a 409 Conflict status if it is — preventing the accidental start of two concurrent executions for the same workflow. After validation passes, the endpoint sets the status to "running" in the database, converts the record to a Pydantic model, and schedules the background task using background_tasks.add_task with both the state id and the pre-loaded working_state. In the next lesson, you'll implement Factor 7 — Contact humans with tool calls, treating human escalation as a first-class tool and building upon these lifecycle endpoints.

Testing Launch and Pause

With both endpoints implemented, you can test the pause flow using a test script that launches an agent and pauses it shortly after:

Running this script produces output showing the agent launching and then pausing successfully:

The agent was successfully paused before it had completed any steps, though it may have been in the middle of processing its first LLM call when the pause signal arrived. The progress callback detected the pause status during the save operation of the first step, causing the agent loop to exit immediately.

Testing Resume and Monitoring Completion

Now you can continue with the same agent by resuming it and polling until completion:

Running this continuation produces output showing the agent resuming and progressing toward completion:

When the resume endpoint was called, the agent restarted from step zero and continued working through the problem, eventually reaching completion after six total steps. The final answer matches what you would expect for the quadratic equation, confirming that the pause and resume cycle did not disrupt the agent's reasoning or cause it to lose context.

Summary and Next Steps

You have now fully implemented Factor 6 — Launch / Pause / Resume with simple APIs. Your agent's lifecycle is modeled explicitly through three endpoints: /agent/launch, /agent/pause, and /agent/resume. The cooperative pause mechanism ensures agents always stop gracefully between steps, while the resume endpoint validates state and restarts execution from exactly where it stopped. In the next lesson, you'll implement Factor 7 — Contact humans with tool calls, enabling agents to proactively request information from users when they need it, treating human escalation as a first-class tool with structured requests and recorded outputs.

Previous Lesson

Next Lesson: Integrating Human Input Back to Agents

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal