How to Replay and Debug Failed AI Agent Runs Step by Step

If you have ever shipped an AI agent into production, you already know the sinking feeling. A user reports that your agent did something completely wrong, you pull up the logs, and all you see is a wall of unstructured text that tells you almost nothing about what actually happened. The ability to replay and debug failed AI agent runs is not a luxury. It is the single most important capability you need to build reliable agent systems. Without it, you are flying blind every time something breaks.

AI agents are fundamentally different from traditional software. A standard API endpoint takes an input, runs deterministic logic, and returns an output. When it fails, the stack trace usually points you directly to the problem. Agents, on the other hand, make a series of decisions. They choose which tools to call, they interpret results, they decide what to do next based on context that shifts with every step. A failure at step seven might have its root cause at step two, buried inside a tool call response that subtly changed the agent's reasoning path.

This is why simple logging is not enough. You need the ability to capture every decision point, visualize the entire execution path, and then replay that exact sequence to understand what went wrong. Replay is not just "reading the logs again." It means rerunning the exact inputs through the same prompt chain, or a modified one, so you can isolate the failure and test your fix before pushing it to production.

In this guide, we will walk through the entire process of debugging failed agent runs. We will cover why agent failures are uniquely difficult, what replay actually means in this context, a step-by-step debugging workflow, common failure patterns to watch for, and how to set up your agent for maximum debuggability from the start.

Why AI Agent Failures Are So Hard to Debug

Traditional software debugging follows a predictable pattern. You get an error, you read the stack trace, you find the line of code that broke, and you fix it. Agent debugging breaks every part of that pattern. Here is why.

Non-determinism is the default. Even with temperature set to zero, large language models do not always produce identical outputs for identical inputs. Minor differences in tokenization, model version updates, or even server-side batching can cause the model to take a different reasoning path. This means that simply rerunning the same request might not reproduce the bug at all.

Multi-step chains create combinatorial complexity. An agent that makes five tool calls in sequence has a failure surface that grows exponentially. The failure might be in the first tool call, in the model's interpretation of the first tool call's result, in the decision to make the second tool call instead of the third, or in any combination of these. You cannot debug step five without understanding steps one through four.

Tool calls introduce external state. When your agent calls a database, an API, or a search engine, the result depends on external state that may have changed since the original run. The database row might have been updated. The API might return different results. The search index might have been refreshed. This makes reproduction even harder.

Context window dynamics are invisible. As an agent accumulates conversation history, tool results, and intermediate reasoning, the context window fills up. At some point, important information gets truncated or the model starts losing track of earlier instructions. This kind of failure is almost impossible to spot from flat logs because you need to see exactly what the model's input looked like at each step.

Error messages from LLMs are often misleading. When a traditional function fails, the error message typically describes the actual problem. When an LLM fails, it might confidently produce incorrect output with no error at all. Or it might produce a vague refusal that does not explain the underlying issue. The failure mode is not "crash" but "wrong behavior," which is far harder to detect and diagnose.

What Replay Actually Means for AI Agents

Let us be precise about terminology because "replay" gets used loosely in the observability space. Reading a log file is not replay. Watching a recording of what happened is not replay. True replay means taking the exact captured inputs from a failed run and feeding them back through your agent pipeline so you can observe the execution in a controlled environment.

For AI agents specifically, replay involves several distinct capabilities. First, you need the complete trace: every prompt sent to the model, every response received, every tool call made, and every tool result returned. Second, you need the ability to rerun that trace through the same model configuration (or a modified one) without needing to recreate the external conditions that existed at the time of the original run. Third, you need to be able to modify a single step and see how the downstream behavior changes.

This last point is critical. The real power of replay is not just seeing what happened. It is asking "what if." What if I change the system prompt at step three? What if I swap the tool result at step five with a corrected version? What if I add a guardrail that prevents the agent from making that particular tool call? Replay turns debugging from a passive, forensic activity into an active, experimental one.

Think of it like a chess analysis board. You do not just review the game. You go back to the move where things went wrong, try a different move, and see how the rest of the game plays out. That is what agent replay gives you: the ability to branch from any point in a failed run and explore alternative paths.

Step-by-Step: How to Debug a Failed Agent Run

Here is a concrete, repeatable workflow for debugging a failed agent run. Follow these steps in order.

Capture the full trace. Before you can debug anything, you need complete observability into the run. This means instrumenting your agent to capture every LLM call, every tool invocation, every input and output at each step. If you are using a tracing tool like Glassbrain, this happens automatically with a one-line SDK integration. If you are building your own, make sure you log the full prompt (including system message), the model's response, any tool calls with their arguments, and the tool results.
Open the visual trace tree. Flat logs are nearly useless for multi-step agent runs. You need a visual representation that shows the execution as a tree or graph, with each node representing a step in the agent's reasoning. This lets you see the overall structure of the run at a glance: how many steps it took, which tools were called, where the branching happened, and where the failure occurred.
Find the broken node. Start from the end (the incorrect output or error) and work backward through the trace tree. At each node, ask: "Was the input to this step correct?" and "Was the output from this step correct?" The broken node is the first one where the output diverges from what you expected, given a correct input. Sometimes the failure is obvious (a tool returned an error). Sometimes it is subtle (the model interpreted a correct tool result incorrectly).
Inspect the full prompt and response. Once you have identified the suspicious node, examine the complete prompt that was sent to the model at that step. Look at the system message, the conversation history, any injected tool results, and the model's full response. Pay attention to how much of the context window was used. Check whether important instructions were present or had been truncated.
Identify the root cause. With the full context visible, categorize the failure. Did the model hallucinate a function argument? Did it pick the wrong tool? Did a tool return unexpected data? Did the context window overflow and lose critical instructions? Did a previous step's output corrupt the reasoning chain? The root cause is often two or three steps before the visible failure.
Replay with a fix. Now modify the trace at the point of failure. If the system prompt was unclear, rewrite it. If a tool result was malformed, correct it. If the model chose the wrong tool, add a constraint. Then replay the run from that point forward to see if your fix resolves the issue without introducing new problems.
Verify across similar traces. A fix that works for one failure might break other cases. Search your trace history for similar runs (same tools, same user intent, same failure pattern) and verify that your fix handles those cases correctly too.
Deploy with confidence. Once you have verified the fix across multiple traces, push it to production. Continue monitoring the relevant traces to confirm the fix holds under real-world conditions.

Common Agent Failure Patterns and How to Spot Them

After debugging hundreds of agent failures, certain patterns emerge repeatedly. Here are the five most common ones and how to identify them in your traces.

Infinite Tool Call Loops

The agent calls the same tool repeatedly with the same or slightly varied arguments, never making progress. In a trace tree, this looks like a long, unbranching chain of identical nodes. The root cause is usually a tool that returns ambiguous results, causing the model to retry without changing its approach. Fix this by adding a maximum retry count or by improving the tool's error messages to give the model actionable feedback.

Wrong Tool Selection

The agent has access to multiple tools and picks the wrong one for the task. This often happens when tool descriptions are too similar or too vague. In the trace, you will see a tool call that makes no sense given the user's request. The fix is usually to rewrite tool descriptions to be more specific about when each tool should and should not be used.

Context Window Overflow

As the agent accumulates history, earlier messages get pushed out of the context window. Critical instructions from the system prompt or early user messages disappear. In the trace, you can spot this by checking the token count at each step and comparing the actual prompt content to what it should contain. The fix involves summarizing intermediate results or using a sliding window strategy.

Hallucinated Function Arguments

The model invents arguments that do not match the tool's schema. It might pass a string where an integer is expected, use a field name that does not exist, or fabricate an enum value. In the trace, compare the tool call arguments against the tool's schema definition. Stricter schema validation and more explicit parameter descriptions help prevent this.

Retrieval Returning Irrelevant Results

When the agent uses a retrieval tool (RAG), the retrieved documents might be irrelevant to the query. The model then reasons over bad data and produces wrong answers. In the trace, inspect the retrieval step's output and assess whether the returned documents actually answer the question. Fixes include improving your embedding model, adding re-ranking, or filtering by metadata.

Setting Up Your Agent for Debuggability

The best time to make your agent debuggable is before the first failure. Here is how to instrument your agent so that when something breaks, you have everything you need to diagnose it quickly.

Instrument every tool call. Every tool your agent can invoke should log its inputs and outputs as part of a structured trace. Do not just log the tool name. Log the full arguments, the full response, the latency, and any errors. This is your primary debugging data.

Capture the complete prompt at every step. The prompt is not just the user's message. It includes the system message, conversation history, injected context, tool results, and any other content that the model sees. Log the entire thing. When debugging, the difference between what you think the model saw and what it actually saw is often the root cause.

Log tool results faithfully. Do not truncate or summarize tool results in your logs. The model saw the full result, and you need to see exactly what the model saw. If a tool returned 500 lines of JSON, log all 500 lines. Disk space is cheap. Debugging time is not.

Use structured traces, not flat logs. A structured trace captures the parent-child relationships between steps. You can see that tool call B happened because of decision A, and that the model's final response was based on the result of tool call C. Flat logs lose this structure entirely, forcing you to reconstruct it manually every time you debug.

Make tracing a default, not an opt-in. If tracing requires extra effort, developers will skip it in prototypes and forget to add it later. Use an SDK that makes tracing automatic. With tools like Glassbrain, a single line of code instruments all your LLM calls. There is no excuse for shipping an unobservable agent.

How Glassbrain Makes Agent Replay Simple

Glassbrain is purpose-built for debugging AI agent runs. Here is how it streamlines the replay and debugging workflow described in this guide.

One-line SDK integration. Install the JavaScript SDK (glassbrain-js) or the Python SDK (glassbrain) and wrap your LLM client with wrapOpenAI, wrap_openai, or wrap_anthropic. Every LLM call, tool invocation, and agent step is automatically captured as a structured trace. No manual instrumentation required.

Visual interactive trace tree. Every trace is displayed as an interactive graph, not a flat list of log lines. You can see the full agent execution path, click into any node to inspect its prompt and response, and immediately spot where the reasoning went wrong. The tree structure makes it obvious which steps are parents, which are children, and how the data flows between them.

Built-in replay with no API keys required. Click any node in the trace tree and replay it directly from the dashboard. Glassbrain uses server-side keys, so you do not need to configure or expose your own API credentials. Modify the prompt, change a tool result, and rerun to see how the agent's behavior changes.

AI-powered fix suggestions. For every failed trace, Glassbrain automatically analyzes the failure and suggests potential fixes. These suggestions are based on the actual trace data, not generic advice. They point you to the specific node that broke and explain why.

The free tier includes 1,000 traces per month with no credit card required. That is enough to debug most development and staging workflows before you ever need to think about pricing.

Frequently Asked Questions

Can I replay an agent run without reproducing the exact external conditions?

Yes. The key to effective replay is capturing the complete trace at the time of the original run, including all tool inputs and outputs. When you replay, you use the captured tool results rather than calling the actual external services again. This means you do not need the database to be in the same state, the API to return the same response, or the search index to have the same content. The trace contains everything needed for replay. Tools like Glassbrain capture this data automatically, so replay works out of the box.

How is replaying an agent run different from rerunning it?

Rerunning means sending the same initial input through your agent and letting it execute from scratch. Because LLMs are non-deterministic and external services may have changed, the rerun might follow a completely different path and never reproduce the original failure. Replay, by contrast, uses the captured trace data to reconstruct the exact sequence of events. You can step through the same decisions, inspect the same outputs, and then selectively modify specific steps to test fixes. Replay is controlled and reproducible. Rerunning is a roll of the dice.

What data do I need to capture to enable replay later?

At minimum, you need the full prompt (including system message and all injected context) sent to the model at every step, the model's complete response at every step, every tool call with its full arguments, every tool result with its full response, and metadata like model version, temperature, and token counts. If you are missing any of these, you will have gaps in your replay that force you to guess what happened. Structured tracing SDKs capture all of this automatically.

How do I debug an agent failure that only happens intermittently?

Intermittent failures are the hardest to debug because you cannot reliably reproduce them on demand. The solution is to have comprehensive tracing running at all times, even in production. When the intermittent failure occurs, you will have the complete trace ready for analysis. Search your trace history for runs with similar inputs and compare the successful ones to the failed ones. The difference is usually in a tool result that varied, a context window that was slightly longer, or a model response that took a different reasoning path. With enough traces, intermittent failures become reproducible patterns.

Should I trace every agent run in production, or only sample a percentage?

Trace every run. Sampling means you will miss failures, and failures are exactly the runs you need traces for. The cost of storing trace data is negligible compared to the cost of debugging a production incident without observability. If you are concerned about performance overhead, modern tracing SDKs add minimal latency (typically under 5 milliseconds per step). The Glassbrain free tier gives you 1,000 traces per month at no cost, and paid plans scale to handle full production traffic without sampling.