LLM Tracing Explained: How to Debug Prompts in Production

LLM Tracing Explained: The Developer Guide

LLM tracing is the practice of recording every step your application takes when it talks to a large language model, then visualizing those steps as a connected tree so you can see exactly what happened, in what order, and why. If you have ever stared at a broken AI feature wondering whether the bug was in your prompt, your retrieval step, your tool call, or the model itself, you have felt the exact pain that llm tracing is designed to solve. Traditional logs tell you that something happened. A good llm trace tells you the full story: the assembled prompt, the raw response, the tokens consumed, the latency of each hop, the model version, and any errors or refusals along the way.

In this guide we will walk through what llm tracing actually is, what a high quality trace should capture, how tracing works under the hood, and how to use it to debug real bugs in production AI apps. We will also compare llm tracing to traditional APM tools, highlight common mistakes teams make when they first start tracing llm calls, and show you how to add tracing to an existing OpenAI or Anthropic project in a single line of code. Whether you are building a chatbot, an agent, a RAG pipeline, or a multi step workflow, this is the mental model you need.

What Is an LLM Trace?

An llm trace is a structured record of a single logical operation in your AI application, broken into smaller units called spans. A span represents one discrete piece of work: a call to OpenAI, a vector search, a tool invocation, a database query, a post processing step. Each span has a start time, an end time, a name, and a bag of metadata. When one span starts work that triggers another span, we say the second span is a child of the first. That parent-child relationship is what turns a flat list of events into a tree.

The trace tree is the heart of llm tracing. At the root you have the top level request, say a handle user message function. Underneath it you might have a retrieval span, a prompt assembly span, an OpenAI completion span, and a tool call span. The tool call span might itself have children: an HTTP request, a database lookup, a second llm call to summarize the result. When you look at the tree, you can immediately see the shape of the work your app did and where time was spent.

Metadata is what makes an llm trace useful rather than just structurally interesting. Good tracing libraries attach the prompt, the response, token counts, model name, temperature, latency, and any error state to each span. That way, when you click into a node in the tree, you are not just seeing that an OpenAI call happened, you are seeing exactly what was sent and what came back. This is the fundamental difference between llm tracing and ordinary logging.

What a Good LLM Trace Captures

Not all tracing is created equal. A trace that only tells you "OpenAI call took 2.3 seconds" is almost useless when you are trying to debug. Here is what a high quality llm trace should capture on every span.

The Full Prompt

The single most important thing an llm trace must capture is the full, assembled prompt that was actually sent to the model. Not the template, not the user message in isolation, but the final string (or message array) after all your variables, retrieved context, system instructions, few shot examples, and conversation history have been stitched together. Most LLM bugs are prompt assembly bugs: a missing variable, a stale context window, a retrieved chunk that got truncated, a system message that was accidentally dropped. If your tracing tool does not show you the exact bytes the model saw, you will spend hours guessing. Glassbrain stores the full prompt on every span so you can inspect it the way the model did.

The Full Response

Just as important is the full response the model returned, before any of your parsing or post processing logic touched it. This matters because a lot of bugs live in the gap between what the model said and what your code thought it said. Maybe the model returned valid JSON with an extra markdown fence. Maybe it refused the request politely and your parser interpreted the refusal as a successful answer. Maybe it hallucinated a field name. Capturing the raw response lets you replay your parsing logic against reality rather than against what you hoped reality looked like. A good llm trace never throws away the raw output.

Token Usage and Cost

Every span that calls a model should record the input tokens, output tokens, and total tokens, along with the derived cost in dollars. This is how you catch runaway context windows, accidental prompt bloat, and expensive tool loops. When you look at a trace tree and see a single span that burned fifty thousand input tokens, you know immediately where your bill is going. Token usage per span also lets you answer business questions like "how much does one customer conversation cost on average" without having to instrument billing separately. Good llm tracing doubles as cost observability.

Latency Per Step

Latency is the reason your users complain. A good llm trace shows wall clock time for every span so you can see exactly which step is slow. Is it the retrieval? The first llm call? A tool call that is hitting a cold database? A sequential chain that should have been parallel? When you visualize latency in a trace tree, slow steps jump out instantly because they are wider than their siblings. This is one area where llm tracing borrows directly from distributed tracing tradition, and the lesson transfers cleanly: measure every hop, show the waterfall, let the eye find the problem.

Model Version and Parameters

It is shockingly easy to ship a change in model version and forget about it. Your trace should record the exact model string returned by the API, the temperature, top_p, max_tokens, any tools you passed, and any response_format you requested. When a bug starts happening on Tuesday and you can see that a new model version started flowing through your traces on Monday evening, you have your answer in seconds instead of hours. This is also critical for reproducibility: without the exact parameters, you cannot replay a trace.

Errors, Refusals, and Tool Calls

Finally, a great llm trace captures the full range of non happy path events: HTTP errors, rate limits, timeouts, content policy refusals, tool calls and their arguments, tool results, and any retries. Tool calls deserve special attention because they are often where agents go off the rails. Your trace should show the tool name, the arguments the model chose, the result that came back, and how the model reacted. Without this level of detail, debugging an agent feels like debugging a black box. With it, you can follow the agent reasoning step by step.

How LLM Tracing Works Under the Hood

Under the hood, llm tracing works by wrapping the functions you already call. When you install a tracing SDK and invoke something like wrapOpenAI, the library returns a thin proxy around the real OpenAI client. Every time your code calls the chat completions create method, the proxy starts a span, records the arguments, forwards the call to the real client, waits for the response, records the response and token usage, and ends the span. From your application perspective, nothing has changed. From the tracing system perspective, a fully instrumented event has just been captured.

Span propagation is the glue that makes the tree work. When a span starts, it stores itself in a context variable (async local storage in Node, contextvars in Python). Any span that starts while that context is active automatically becomes a child. This is how nested spans know their parent without you having to pass IDs around manually. It is also how tool calls inside an agent loop end up nested correctly under their parent completion.

Exporting is done asynchronously. The SDK batches completed spans in memory and ships them to the tracing backend on a timer or when the batch fills up. This is why tracing llm calls does not slow down your app: the hot path never waits on the network. If the export fails, the SDK retries with backoff and eventually drops the batch rather than blocking your application. Glassbrain SDK follows this pattern and will never crash your app if the backend is unreachable.

If you have used OpenTelemetry before, all of this will sound familiar. Llm tracing borrows the span model directly from OTel, but specializes it for AI workloads by adding first class fields for prompts, responses, tokens, and model metadata. You can think of llm tracing as OpenTelemetry with AI native ergonomics.

How to Use LLM Tracing to Debug a Bug

Here is the workflow we see most teams settle into once they have llm tracing in place. It is simple, repeatable, and dramatically faster than reading logs.

An alert fires or a user complains. Something is wrong: the bot gave a weird answer, a workflow stalled, costs spiked, latency went through the roof. You have a user ID, a conversation ID, or a timestamp.
Open the relevant trace. Jump into your tracing dashboard and filter by the identifier you have. You are now looking at a single llm trace representing exactly what happened during that request.
Scan the tree shape. Before clicking anything, look at the overall structure. Is there a span that is obviously too wide (slow)? A branch that looks shorter or longer than expected? A tool call that fired ten times in a loop? The shape of the tree often tells you the bug category within five seconds.
Find the broken span. Drill into the suspect node. Errors are marked clearly. If there is no error, look for a span where the output does not match what the next span expected as input. That boundary is almost always where the bug lives.
Click into the prompt and response. Now you are looking at the exact bytes the model saw and returned. Ninety percent of the time, the bug is now visible to the naked eye: a missing variable in the prompt, a retrieved chunk that is not what you expected, a model response that your parser mishandled, a tool argument that got mangled.
Fix and replay. Make the fix in your code. Then, instead of waiting for the bug to recur in production, use the replay feature to run the exact same trace through your updated logic. Glassbrain includes replay built in, with no user API keys required, so you can verify your fix against the real captured inputs in seconds.

This loop (alert, open, scan, drill, fix, replay) is what llm tracing unlocks. It turns AI debugging from a guessing game into a methodical inspection process. The first time you fix a gnarly prompt bug in under two minutes because you could see the exact assembled prompt in a trace, you will never want to go back to print statements.

LLM Tracing vs Traditional APM

Traditional application performance monitoring (APM) tools like Datadog APM, New Relic, and Honeycomb are excellent at what they were built for: measuring latency, error rates, and throughput across HTTP services and databases. They can technically trace llm calls too, because an OpenAI request is just an HTTP request. But they were not designed with AI workloads in mind, and it shows in three ways.

First, they do not capture prompts and responses by default. An APM span for an OpenAI call will tell you the URL, the status code, and the duration, but not the actual prompt or completion. Without those, you cannot debug most LLM bugs. You can shove the prompt into a custom attribute, but most APMs truncate string attributes aggressively and charge you for cardinality.

Second, they do not model tokens or cost. You have to build that yourself, usually by parsing response bodies and emitting custom metrics. Purpose built llm tracing tools do this automatically for every provider.

Third, they do not understand the semantics of AI workflows: tool calls, agent loops, retries, refusals, streaming responses. A purpose built tool for tracing llm calls treats these as first class concepts and renders them appropriately in the trace tree.

The short version: traditional APM answers "is my service healthy?" Llm tracing answers "why did the model say that?" You often want both, and they complement each other well.

Adding LLM Tracing to Your App in One Line

The reason llm tracing has taken off is that modern SDKs make it genuinely trivial to add. With Glassbrain, you install the SDK, grab an API key from the dashboard, and wrap your existing client. In JavaScript you call wrapOpenAI on a new OpenAI instance. In Python it is wrap_openai or wrap_anthropic around the provider client. That is the entire integration. Every call you already make is now traced, with prompts, responses, tokens, cost, and latency captured automatically and rendered as a visual trace tree in the dashboard.

Glassbrain gives you 1,000 traces per month on the free tier, with no credit card required, which is enough for most side projects and early stage products to run indefinitely. The SDK is open, the ingestion is async so your app never slows down, and replay is built in so you can rerun any captured trace against your updated code. Anthropic tracing and OpenAI tracing work identically, so you can mix providers in the same app without thinking about it.

Common LLM Tracing Mistakes to Avoid

Teams that are new to llm tracing tend to make the same handful of mistakes. Knowing them in advance will save you a lot of time.

Not capturing the assembled prompt. The most common mistake is logging the prompt template and the variables separately, then trying to reconstruct what actually got sent. You will get it wrong. Always capture the final assembled prompt exactly as it left your process. This is non negotiable for debugging.

Missing tool call traces. When your LLM calls a tool, that tool call and its result must live as their own spans in the trace. Teams often wrap the OpenAI call but forget to instrument the tools the model invokes, which leaves a giant hole in the middle of every agent trace. Wrap your tool executors too.

Not tracking model version. Providers silently roll out new model snapshots, and your bug might be caused by a provider side change rather than your code. If your llm trace does not record the exact model string returned by the API, you will have no way to correlate bugs with model updates. Capture it on every span.

Sampling too aggressively. In traditional APM, sampling one percent of traces is fine because you mostly care about aggregates. In llm tracing, you often care about the weird individual request, the one where the model did something strange. If you sample too aggressively, you will lose exactly the traces you most want to debug. Start by capturing everything and only add sampling once you have real volume concerns.

Frequently Asked Questions

What is llm tracing?

Llm tracing is the practice of recording every step of a large language model interaction (prompt, response, tokens, latency, model version, tool calls, errors) as a structured tree of spans, so you can see exactly what your AI application did and debug it visually instead of reading raw logs.

Is llm tracing the same as logging?

No. Logging produces a flat stream of text lines that you have to grep through. Llm tracing produces a structured tree of spans, each with rich metadata (full prompt, full response, tokens, latency, model), that you can explore visually. Tracing tells you the causal story of a request. Logging just tells you that things happened.

How do I add tracing to OpenAI calls?

The easiest way to add openai tracing is to use a wrapper function from a tracing SDK. With Glassbrain you call wrapOpenAI on your OpenAI client in JavaScript, or wrap_openai on it in Python, and every call you already make is traced automatically.

Does tracing slow down my app?

No, not in any meaningful way. Well built llm tracing SDKs capture spans in memory and export them asynchronously in batches, so your request path never waits on the tracing backend. If the backend is unreachable, the SDK retries and eventually drops the batch rather than blocking your code.

Can I trace Anthropic Claude calls?

Yes. Anthropic tracing works the same way as OpenAI tracing. With Glassbrain you wrap your Anthropic client using wrap_anthropic in Python, and every Claude call is captured with full prompt, response, tokens, model version, and latency. You can mix Anthropic and OpenAI in the same app and they show up together in the same trace tree.

What is the easiest llm tracing tool?

For most developers, the easiest option is Glassbrain: one line install, JS and Python SDKs, visual trace tree, replay built in, AI fix suggestions, and 1,000 free traces per month with no credit card. You can be up and tracing llm calls in under sixty seconds.

Conclusion

Llm tracing is quickly becoming a required part of the AI development stack, and for good reason. Large language models are non deterministic, context sensitive, and opaque, which means the old debugging tools (print statements, raw logs, stack traces) simply do not give you enough information to understand what went wrong. A good llm trace captures the full prompt, the full response, tokens, cost, latency, model version, tool calls, and errors for every step of your application, then renders them as a visual tree you can explore. That one change turns AI debugging from guesswork into engineering.

If you have not yet added tracing to your AI app, the barrier to entry has never been lower. Modern SDKs let you wrap your existing OpenAI or Anthropic client in a single line, and tools like Glassbrain give you a generous free tier so you can try it with zero commitment. The next time a user reports a weird answer or a workflow goes off the rails, you will open the trace, see exactly what happened, fix it, replay the trace to verify your fix, and ship. That is the workflow ai tracing unlocks, and once you have it, you will wonder how you ever built LLM apps without it.