How to Trace and Monitor Every LLM Request and Response in Your App

Every LLM Call Is a Black Box Unless You Capture It

Large language models are probabilistic systems. The same prompt can produce different outputs depending on model version, temperature settings, system load, and dozens of other variables you never directly control. When you ship an application that relies on LLM calls, you are building on a foundation that shifts with every request. If you are not capturing every request and response, you have no way to know what your application actually said to users, why a particular output was wrong, or whether performance is degrading over time.

The problem compounds quickly. A single user interaction might trigger multiple LLM calls: one for classification, one for retrieval-augmented generation, one for summarization, and one for formatting. If any of those calls returns unexpected output, the downstream calls inherit that error and amplify it. Without a complete trace of what happened at each step, debugging becomes guesswork. You end up reading through application logs, trying to reconstruct the sequence of events manually, and hoping you can reproduce the problem in a test environment. That approach does not scale.

Traditional application monitoring does not solve this. APM tools track HTTP status codes and response times, but they treat the LLM call as an opaque function. They cannot tell you that the model refused a valid request, hallucinated a fact, or consumed three times the expected tokens. You need purpose-built tracing that understands the structure of LLM interactions and captures the full context of every call.

This guide covers what to capture on every LLM request, the practical methods for capturing it, how to turn raw logs into structured traces, what to monitor once the data is flowing, and how to set up complete capture in minutes with Glassbrain.

What to Capture on Every LLM Request

Before choosing a tool or building a pipeline, you need to know exactly what data points matter. Missing even one of these fields can leave you blind when debugging a production issue. The following seven categories cover everything you need to trace and monitor every LLM request and response effectively.

Full Prompt (Including System Messages)

Capture the complete prompt sent to the model, including system messages, user messages, assistant prefills, and any few-shot examples. This means logging the exact payload, not a truncated summary. When an output is wrong, the first question is always "what did we actually send?" Store prompts in their structured form (the array of message objects) rather than as a single concatenated string. Structured storage lets you search by role, filter by system prompt version, and compare prompts side by side. If you are using retrieval-augmented generation, include the retrieved context that was injected into the prompt. Without the full input, you cannot determine whether a bad output was caused by a bad prompt, bad context, or a model issue.

Full Response (Including Metadata)

Log the entire response object, not just the text content. This includes finish reason, which tells you whether the model stopped naturally, hit a token limit, or was filtered by a content policy. For streaming responses, reconstruct and store the complete output after the stream finishes. Record the response ID if the provider returns one, as this is essential for filing support tickets or investigating provider-side issues. If the model returns structured output (JSON mode, tool calls), validate and log the parsed structure alongside the raw text. The response metadata often contains information that is invisible in the text output but critical for debugging.

Token Counts (Input, Output, Total)

Record prompt tokens, completion tokens, and total tokens for every call. Token counts are your primary cost signal. Without them, you cannot attribute spend to specific features, users, or prompt versions. They also serve as an early warning system: if a prompt's token count suddenly spikes, something upstream changed. Track token counts over time to detect prompt drift, where gradual changes to prompt templates or retrieved context cause token usage to creep upward without anyone noticing. Many teams discover that a single poorly optimized prompt accounts for a disproportionate share of their LLM spend. You cannot find these problems without per-call token data.

Latency (Wall Clock and Time to First Token)

Measure wall-clock time from request start to response completion, including time-to-first-token (TTFT) for streaming calls. Latency directly impacts user experience, and LLM latency is highly variable. A call that takes 800 milliseconds on average might spike to 5 seconds during peak provider load. TTFT is especially important for chat interfaces, where users perceive responsiveness based on when the first characters appear. Track both metrics separately. A request with fast TTFT but slow total time might indicate a long response, while slow TTFT with fast total time suggests provider queuing delays. Breaking latency into these components helps you diagnose whether performance issues originate in your application, the network, or the model provider.

Model Version (Resolved, Not Requested)

Always log the exact model identifier returned by the API, not just the model you requested. Providers frequently update models behind version aliases. When you request "gpt-4o," the actual model served might change without notice. Logging the resolved model version lets you correlate quality changes with model updates. This data is invaluable during regression investigations. If output quality dropped on Tuesday, and you can see that the resolved model version changed on Tuesday, you have your answer. Without this field, you would spend hours testing prompt changes that were never the problem.

Tool Calls (Name, Arguments, Results)

If your application uses function calling or tool use, capture every tool invocation: the tool name, the arguments the model generated, the result returned, and any errors that occurred during execution. Tool calls are a common source of failures in agentic applications. The model might generate syntactically valid but semantically wrong arguments, call the wrong tool, or enter infinite loops of tool invocations. Without logging the full tool call chain, these issues are nearly impossible to diagnose. Record the sequence of tool calls within a single turn, as the order often matters. Also log cases where the model chose not to call a tool when it should have, as these "silent failures" are harder to detect but equally damaging to user experience.

Errors, Refusals, and Rate Limits

Log all error responses, including rate limits (HTTP 429), timeouts, server errors (HTTP 500 and 503), content filter triggers, and model refusals. A spike in rate limit errors might indicate thundering herd behavior in your application. A pattern of content filter triggers might reveal users pushing your application in unexpected directions. Record the full error response body, not just the status code, as providers often include diagnostic information in error messages. Track retry behavior as well: how many retries each request required before succeeding or permanently failing. This data helps you tune retry logic and identify calls that consistently operate near the edge of reliability. Refusals deserve special attention. When a model refuses a request, it is not an error in the traditional sense, but it is a failure from the user's perspective. Tracking refusal patterns helps you identify prompt weaknesses and content policy boundaries.

Three Ways to Capture LLM Requests

Once you know what to capture, the question is how. There are three common approaches, each with distinct trade-offs in coverage, complexity, and maintenance burden.

Manual Logging

The simplest approach is adding logging statements around every LLM call in your codebase. Before the call, log the prompt and parameters. After the call, log the response, tokens, and latency. This works for applications with a handful of call sites, but breaks down quickly as your codebase grows. Every new LLM call needs its own logging code. Developers forget to add it. The log format drifts between files as different team members implement logging differently. Refactors break logging without anyone noticing until a production incident forces someone to check. Manual logging also cannot capture calls made inside third-party libraries or frameworks. If you use a library that makes LLM calls internally, those calls are invisible to your logging.

Proxy-Based Capture

A proxy sits between your application and the LLM provider, intercepting all traffic. This approach guarantees complete capture because every request must pass through the proxy regardless of where it originates in your code. However, it adds a network hop to every call, increasing latency by several milliseconds at minimum and potentially more under load. You need to run and maintain the proxy infrastructure, handle TLS termination, manage failover, and accept the operational complexity. Proxy-based capture can also create a single point of failure: if the proxy goes down, all LLM calls fail. For teams with strong infrastructure skills and high compliance requirements, proxies can make sense. For most teams, the operational overhead outweighs the benefits.

SDK Wrapping

SDK wrapping intercepts calls at the client library level. Instead of modifying every call site or routing traffic through a proxy, you wrap the LLM client once at initialization. All subsequent calls through that client are automatically captured with full request and response data, token counts, latency measurements, and error details. This combines the completeness of the proxy approach with the simplicity of in-process logging and adds zero network latency. The wrapper runs in the same process as your application and captures data before and after each call without any external infrastructure. Tools like Glassbrain use this pattern, offering one-line wrappers such as wrapOpenAI for JavaScript and wrap_openai or wrap_anthropic for Python. SDK wrapping is the approach most teams should start with because it delivers complete capture with minimal setup and no infrastructure to maintain.

Raw Logs vs. Structured Traces

Logging individual LLM calls is necessary but not sufficient. The distinction between raw logs and structured traces is the difference between a pile of receipts and a financial ledger. Both contain the same underlying data, but only one lets you understand what actually happened.

Raw logs are flat records: timestamp, request, response, tokens, latency. They answer the question "what happened on this specific call?" but they cannot answer "what happened during this user's request?" because they lack the relationships between calls. In any real application, a single user action triggers a chain of LLM calls. A customer support chatbot might classify the user's intent, retrieve relevant documentation, generate a response, check the response for accuracy, and format it for display. Five LLM calls, all serving one user request.

A structured trace represents one end-to-end operation. Within that trace, each LLM call is a span. Spans can be nested: a parent span for the overall request contains child spans for retrieval, generation, and validation. This parent-child structure turns a flat list of log entries into a tree that mirrors the actual execution flow. Each span carries its own timing data, so you can see not only that a pipeline took 4 seconds total, but that 3.2 seconds were spent in the generation span and 0.5 seconds in validation.

Consider a RAG pipeline. The user sends a question. Your application embeds the query (span 1), searches a vector database (span 2), constructs a prompt with retrieved context (span 3), calls the LLM (span 4), and checks for hallucinations (span 5). A flat log shows five separate, disconnected events. A trace tree shows that all five are children of a single parent trace, with timing relationships that reveal the bottleneck immediately.

Visual trace viewers, like the one built into Glassbrain, render these trees as interactive timelines where you can expand each span to see its full prompt, response, token counts, and latency. This turns hours of log-file archaeology into seconds of visual inspection. When a user reports a bad response, you find their trace, open the tree, and see exactly what happened at every step.

Monitoring Metrics That Matter

Once traces are flowing, you need to decide what to monitor and what thresholds justify an alert. The following table covers the core metrics every team should track when monitoring LLM requests and responses in production.

Metric	Why It Matters	Alert Threshold Example
P50 / P95 / P99 Latency	Detects provider slowdowns and prompt bloat	P95 exceeds 3 seconds
Error Rate	Catches rate limits, timeouts, API outages	Error rate exceeds 2% over 5 minutes
Token Usage per Request	Controls cost, identifies prompt inefficiency	Average tokens exceed baseline by 50%
Cost per User / Feature	Enables budgeting and abuse detection	Single user exceeds daily threshold
Refusal Rate	Reveals content policy triggers and prompt gaps	Refusal rate exceeds 1%
Tool Call Success Rate	Tracks function calling reliability in agents	Failure rate exceeds 5%
Model Version Distribution	Detects silent model updates by the provider	New model version appears in traffic
Time to First Token (TTFT)	Measures perceived responsiveness for streaming	TTFT P95 exceeds 1.5 seconds
Trace Completion Rate	Ensures your tracing pipeline itself is healthy	Completion rate drops below 99%

Beyond operational metrics, track quality signals. If your application includes user feedback mechanisms (thumbs up/down, ratings, corrections), correlate that feedback with trace data to identify which prompt versions and model versions produce the best outcomes. Quality metrics are harder to automate but far more valuable for long-term product improvement. Consider building a feedback loop where low-rated responses are automatically flagged for review, with the full trace attached for context.

Setting Up Complete Capture in 5 Minutes

Glassbrain provides a hosted tracing platform with one-line SDK integration that lets you trace and monitor every LLM request and response without building infrastructure.

Step 1: Install the SDK. For JavaScript, run npm install glassbrain-js. For Python, run pip install glassbrain. Both SDKs support all major LLM providers.

Step 2: Wrap your client. In JavaScript, import wrapOpenAI from the Glassbrain SDK and wrap your OpenAI client instance. In Python, use wrap_openai for OpenAI or wrap_anthropic for Anthropic. The wrapper is a single function call that returns an instrumented client. Your existing code continues to work exactly as before.

Step 3: Make calls as usual. Every LLM call through the wrapped client is automatically traced with full request payload, complete response, token counts, latency, model version, tool calls, and errors. No additional code changes are needed at individual call sites.

Step 4: View traces in the dashboard. Open the Glassbrain dashboard to see a visual trace tree for every request. Expand any span to inspect its full prompt and response. Use filters to find traces by model, status, latency, or custom metadata.

Step 5: Replay and debug. Built-in replay lets you re-run any traced request without needing your own API keys. AI-powered fix suggestions analyze failed traces and propose corrections. The free tier includes 1,000 traces per month with no credit card required, so you can start capturing traces today with zero commitment.

The entire setup takes less than five minutes. There is no self-hosting, no infrastructure to manage, and no configuration files to maintain. The SDK handles batching, retries, and asynchronous transmission so that tracing never blocks your application's hot path.

Common Mistakes When Logging LLM Requests

Teams that are new to LLM observability tend to make the same mistakes. Avoiding these pitfalls from the start will save you significant debugging time later.

Truncating prompts or responses. Storage is cheap. Debugging time is expensive. Some teams truncate long prompts or responses to save space, then discover during an incident that the critical information was in the truncated portion. Never discard data. If storage costs are a concern, compress the data or move it to cold storage after a retention period, but keep the full content available.

Logging only successful calls. Error responses contain critical diagnostic information. A 429 tells you about capacity. A timeout tells you about infrastructure. A content filter trigger tells you about user behavior. If you only log successes, you are blind to the worst user experiences, which are exactly the ones you need to understand and fix.

Using unstructured log formats. Writing LLM data to plain text log files makes querying nearly impossible at scale. Use structured formats (JSON) with consistent field names across all call sites. Enforce a schema for your log records so that every trace has the same queryable fields.

Ignoring trace correlation. Logging individual calls without linking them to a parent trace ID means you cannot reconstruct multi-step pipelines. Always propagate a trace ID through your call chain. This is the single most important architectural decision for LLM observability.

Forgetting about PII. LLM prompts often contain user data: names, email addresses, account numbers, and personal questions. Make sure your logging pipeline respects your data retention policies and privacy requirements. Plan for PII handling from day one, not as an afterthought. Consider implementing automatic redaction for sensitive fields or ensuring that your trace storage meets the same compliance standards as your primary database.

Not monitoring the monitoring. If your tracing pipeline silently fails, you lose visibility exactly when you need it most. Add health checks to your trace ingestion endpoint. Track the trace completion rate (the percentage of LLM calls that successfully produce a trace record). Alert if this rate drops below 99%.

Sampling instead of capturing everything. Some teams sample traces (capturing only 10% or 1% of requests) to reduce costs. This works for high-volume web traffic, but LLM calls are expensive, relatively low-volume, and individually important. A single bad LLM response can cause a user to lose trust in your product. Capture everything. If your trace volume is so high that storage becomes a concern, you probably have a cost optimization problem with your LLM usage itself.

Frequently Asked Questions

Does logging every LLM request add latency to my application?

SDK-based tracing that sends data asynchronously adds negligible latency, typically under 1 millisecond per call. The trace data is buffered in memory and transmitted in batches on a background thread, so it never blocks the main request path. The LLM API call itself (which takes hundreds of milliseconds to several seconds) is always the bottleneck. Proxy-based approaches add more latency due to the extra network hop, typically 5 to 20 milliseconds depending on proxy location and load.

How much storage do LLM traces require?

A typical LLM trace ranges from 2 KB to 50 KB depending on prompt and response length. At 10,000 requests per day, that works out to roughly 100 MB to 500 MB per day. At 100,000 requests per day, expect 1 GB to 5 GB daily. Hosted platforms like Glassbrain handle storage and retention automatically, so you do not need to provision or manage storage infrastructure. For self-hosted solutions, plan for 30 to 90 days of hot storage with older data moved to cold storage.

Should I log LLM requests in development or only in production?

Both environments benefit from tracing, but for different reasons. Production logging reveals how your application behaves with real user inputs at real scale, surfacing edge cases that synthetic test data never triggers. Development logging helps you iterate on prompts, compare model outputs, and catch regressions before they ship. Many of the most important issues (prompt injection attempts, unexpected model behavior with certain input patterns, token usage spikes from edge-case inputs) only appear in production traffic. Start with production, then extend to development and staging.

Can I trace requests across multiple LLM providers in the same application?

Yes. Modern tracing platforms support multi-provider tracing. Glassbrain provides wrappers for both OpenAI (wrapOpenAI, wrap_openai) and Anthropic (wrap_anthropic), and traces from different providers appear in the same trace tree. This is essential for applications that use multiple models: for example, using a fast model for classification and a more capable model for generation. The trace tree shows the full chain regardless of which provider handled each step, giving you a unified view of cross-provider pipelines.

What is the difference between LLM logging and LLM observability?

Logging is recording the data. Observability is the ability to understand your system's internal state by examining its outputs. Logging gives you the raw material: request payloads, response bodies, token counts, error codes. Observability adds structure (traces that link related calls), analysis (dashboards, alerts, trend detection), and insight (quality metrics, cost attribution, performance baselines). You cannot have observability without logging, but logging alone is not observability. The gap between the two is where most teams struggle, and it is exactly the gap that structured tracing platforms fill.