LLM Tracing: The Complete Guide for AI Developers

If you have ever stared at a chatbot response and wondered why on earth did it say that, you are not alone. LLM tracing is the practice of recording every step your AI application takes from user input to final output, giving you a structured, inspectable record of what happened and why. Unlike traditional logging, which captures flat text lines, LLM tracing captures a tree of interconnected nodes that represent the actual decision path your application followed.

This guide covers everything you need to know to instrument your AI applications with tracing, read traces to find bugs, and use that data to ship better products. Whether you are building a simple chatbot or a complex multi-agent RAG pipeline, tracing is the single most impactful debugging tool you can add.

How LLM Tracing Differs from Traditional Application Tracing

Traditional distributed tracing (think Jaeger, Zipkin, Datadog APT) was designed for request/response cycles across microservices. A span represents a unit of work: an HTTP call, a database query, a cache lookup. The mental model is linear and predictable.

LLM tracing operates in a fundamentally different problem space:

Non-deterministic outputs. The same input can produce different outputs on every call. You need to capture the exact output alongside the input to reproduce issues.
Token-level economics. Every LLM call has a cost measured in tokens. Traces need to capture token counts, model parameters, and latency to help you optimize spend.
Multi-step reasoning. A single user query might trigger document retrieval, re-ranking, prompt assembly, multiple LLM calls, tool use, and post-processing. Each step can introduce errors that compound.
Context windows matter. What goes into the prompt is often more important than the code that assembles it. Tracing must capture the full prompt text, not just metadata.

Traditional APM tools can tell you that your /api/chat endpoint took 3.2 seconds. LLM tracing tells you that 1.8 seconds was spent retrieving documents, the retriever returned 4 chunks but only 1 was relevant, the prompt used 3,200 tokens, and the model hallucinated because the relevant context was buried at position 3 in the context window.

The Anatomy of an LLM Trace

A well-structured LLM trace is a directed acyclic graph (DAG) of nodes. Here are the core node types you will encounter:

Input Node

Captures the raw user input and any session metadata. This is your trace root. It should include the user message, conversation history reference, and any client-side metadata like user ID or session ID.

Retrieval Node

Records what documents or chunks were fetched from your vector store or search index. Critical fields: query embedding, number of results, similarity scores, and the actual text content returned. This is where most RAG bugs originate.

Prompt Assembly Node

Shows the final prompt sent to the model. This includes the system message, retrieved context, conversation history, and user query stitched together. Capture the full text and the token count.

LLM Call Node

The model invocation itself. Record: model name, temperature, max tokens, stop sequences, the complete response, token usage (prompt tokens, completion tokens), latency, and any tool calls the model requested.

Output Node

The final response delivered to the user, after any post-processing, filtering, or formatting. Include any citations or sources that were attached.

Adding LLM Tracing to Your AI App

Let us start with practical instrumentation. The following examples show how to wrap your existing LLM calls with tracing using a lightweight approach that works with any backend.

Tracing OpenAI SDK Calls

import OpenAI from "openai";

const openai = new OpenAI();

async function tracedCompletion(userMessage, traceContext) {
  const span = traceContext.startSpan("llm_call", {
    model: "gpt-4o",
    type: "chat_completion",
  });

  const startTime = Date.now();

  try {
    const response = await openai.chat.completions.create({
      model: "gpt-4o",
      messages: [
        { role: "system", content: "You are a helpful assistant." },
        { role: "user", content: userMessage },
      ],
      temperature: 0.7,
    });

    const latencyMs = Date.now() - startTime;
    const choice = response.choices[0];

    span.setAttributes({
      "llm.prompt_tokens": response.usage.prompt_tokens,
      "llm.completion_tokens": response.usage.completion_tokens,
      "llm.total_tokens": response.usage.total_tokens,
      "llm.latency_ms": latencyMs,
      "llm.finish_reason": choice.finish_reason,
      "llm.response_preview": choice.message.content.slice(0, 200),
    });

    span.end("success");
    return choice.message.content;
  } catch (error) {
    span.setAttributes({ "error.message": error.message });
    span.end("error");
    throw error;
  }
}

Tracing Anthropic SDK Calls

import anthropic
import time

client = anthropic.Anthropic()

def traced_completion(user_message, trace_context):
    span = trace_context.start_span("llm_call", attributes={
        "model": "claude-sonnet-4-20250514",
        "type": "chat_completion",
    })

    start_time = time.time()

    try:
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=1024,
            messages=[
                {"role": "user", "content": user_message}
            ],
        )

        latency_ms = (time.time() - start_time) * 1000

        span.set_attributes({
            "llm.input_tokens": response.usage.input_tokens,
            "llm.output_tokens": response.usage.output_tokens,
            "llm.latency_ms": latency_ms,
            "llm.stop_reason": response.stop_reason,
        })

        span.end("success")
        return response.content[0].text

    except Exception as e:
        span.set_attributes({"error.message": str(e)})
        span.end("error")
        raise

OpenTelemetry for LLM Tracing

OpenTelemetry LLM tracing is becoming the industry standard for instrumentation. The OpenTelemetry project has been working on semantic conventions specifically for generative AI, which means you can use a vendor-neutral format to capture LLM-specific data.

How It Works

OpenTelemetry extends its existing span model with LLM-specific attributes. The gen_ai namespace defines standard attribute names like gen_ai.system, gen_ai.request.model, gen_ai.usage.prompt_tokens, and gen_ai.usage.completion_tokens. This means your traces are portable across any backend that supports OTLP.

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Set up the tracer with OTLP export
provider = TracerProvider()
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="https://your-collector:4317"))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

tracer = trace.get_tracer("my-ai-app")

def query_llm(prompt):
    with tracer.start_as_current_span("llm.chat") as span:
        span.set_attribute("gen_ai.system", "openai")
        span.set_attribute("gen_ai.request.model", "gpt-4o")
        span.set_attribute("gen_ai.request.temperature", 0.7)

        response = openai.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}],
        )

        span.set_attribute("gen_ai.usage.prompt_tokens", response.usage.prompt_tokens)
        span.set_attribute("gen_ai.usage.completion_tokens", response.usage.completion_tokens)
        span.set_attribute("gen_ai.response.finish_reason", response.choices[0].finish_reason)

        return response.choices[0].message.content

Pros and Cons

Pros: Vendor-neutral, well-supported ecosystem, integrates with existing observability stacks, semantic conventions are maturing fast, and most LLM tracing platforms accept OTLP data natively.

Cons: The gen_ai semantic conventions are still evolving (some attributes may change), capturing full prompt/response text in span attributes can hit size limits, and the setup overhead is non-trivial for small projects. For teams already using OpenTelemetry, it is a natural fit. For greenfield projects, a purpose-built LLM tracing tool may get you to value faster.

Instrumenting a RAG Chatbot in Under 5 Minutes

Here is a complete example that traces a RAG pipeline end to end. This captures every node type we discussed earlier.

import OpenAI from "openai";
import { createTrace } from "./tracing"; // your tracing lib

const openai = new OpenAI();

async function ragChat(userQuery) {
  const trace = createTrace("rag_chat", { user_query: userQuery });

  // 1. Retrieval
  const retrievalSpan = trace.span("retrieval");
  const chunks = await vectorStore.search(userQuery, { topK: 5 });
  retrievalSpan.log({
    num_results: chunks.length,
    scores: chunks.map((c) => c.score),
    previews: chunks.map((c) => c.text.slice(0, 100)),
  });
  retrievalSpan.end();

  // 2. Prompt Assembly
  const promptSpan = trace.span("prompt_assembly");
  const context = chunks.map((c) => c.text).join("\n\n---\n\n");
  const systemPrompt = `Answer based on the following context:\n\n${context}`;
  const messages = [
    { role: "system", content: systemPrompt },
    { role: "user", content: userQuery },
  ];
  promptSpan.log({
    system_prompt_tokens: countTokens(systemPrompt),
    total_messages: messages.length,
  });
  promptSpan.end();

  // 3. LLM Call
  const llmSpan = trace.span("llm_call");
  const response = await openai.chat.completions.create({
    model: "gpt-4o",
    messages,
    temperature: 0.3,
  });
  llmSpan.log({
    model: "gpt-4o",
    prompt_tokens: response.usage.prompt_tokens,
    completion_tokens: response.usage.completion_tokens,
    finish_reason: response.choices[0].finish_reason,
  });
  llmSpan.end();

  // 4. Output
  const output = response.choices[0].message.content;
  trace.end({ output_preview: output.slice(0, 200) });

  return output;
}

That is fewer than 50 lines of instrumentation code, and it gives you full visibility into every stage of your pipeline. When something goes wrong, you will know exactly where to look.

Reading a Trace to Find Bugs: A Real Example

Let us walk through a concrete debugging scenario. You are building a customer support bot for an e-commerce company. A user asks: "What is your return policy for electronics?" The bot responds with the return policy for clothing.

Without tracing, you would start guessing. Maybe the prompt is wrong? Maybe the model is hallucinating? Maybe the embeddings are bad? With tracing, you open the trace and immediately see:

Input Node: "What is your return policy for electronics?" - looks correct.
Retrieval Node: 5 chunks returned. Chunk 1 (score: 0.89) is about clothing returns. Chunk 2 (score: 0.87) is about electronics returns. Chunk 3-5 are about shipping.
Prompt Assembly Node: The context window has the clothing return policy appearing first, before the electronics policy.
LLM Call Node: The model latched onto the first relevant-looking policy in the context and used it for the answer.

The bug is now obvious: your retriever is returning the right documents but in the wrong order. The clothing policy has a slightly higher similarity score than the electronics policy. The fix could be to add metadata filtering by product category before similarity search, or to re-rank results using a cross-encoder.

This is the kind of bug that would take hours to find with print statements and days to find by staring at embeddings. With a proper trace, you found it in 30 seconds.

From Traces to Fixes with Visual Trace Tools

Reading raw trace data in JSON or log files works, but it does not scale. When you are debugging a pipeline with 15 nodes across multiple LLM calls and tool invocations, you need a visual representation.

This is where purpose-built tools like Glassbrain change the workflow. Instead of parsing JSON, you see your trace as an interactive node graph. Click on the retrieval node and you see every document that was returned with its score. Click on the prompt node and you see the exact text that was sent to the model. Click on the LLM node and you see the response, token usage, and latency.

The real power comes from being able to modify and replay. Found a bad retrieval result? Swap it out and re-run the downstream nodes to see if the output improves. Suspect the temperature is too high? Adjust it in the trace view and test immediately. This tight feedback loop between identifying a problem in a trace and validating a fix is what turns tracing from a passive observability tool into an active debugging workflow. See the docs for integration guides with all major LLM frameworks.

Best Practices for LLM Tracing

What to Capture

Full prompt text. You cannot debug what you cannot see. Always capture the complete prompt sent to the model.
Full response text. Same reasoning. Truncated responses are useless for debugging.
Token counts and latency. Essential for cost optimization and performance monitoring.
Retrieval scores and content. The most common source of RAG bugs. Capture similarity scores and the actual text of retrieved chunks.
Model parameters. Temperature, top_p, max_tokens, model version. These affect output quality and reproducibility.
Error states and retries. Capture rate limits, timeouts, and content filter triggers.

What Not to Capture

PII in production traces. Scrub or hash user identifiers, email addresses, and other personal data before storing traces. Build a sanitization layer into your tracing pipeline.
Embeddings vectors. They are large, opaque, and rarely useful for debugging. Capture the text that was embedded, not the vector itself.
Intermediate framework internals. LangChain and LlamaIndex emit dozens of internal events. Capture the ones that map to your logical pipeline steps and ignore the rest.

Managing Performance Overhead

Tracing adds latency. Here is how to keep it minimal:

Async export. Never block the request path on trace export. Buffer spans and flush them in the background.
Sample in production. You do not need to trace 100% of requests. A 10-20% sample rate catches most issues while keeping overhead negligible.
Use structured data, not string concatenation. Building trace payloads with string formatting is slow and memory-intensive. Use dictionaries and let the serializer handle it.
Set retention policies. Traces are large. Keep detailed traces for 7-14 days and aggregate metrics for longer. Most bugs surface within hours of deployment, not weeks.

The worst choice is no tracing at all. Every week you run an LLM application without tracing is a week where bugs hide in the gap between "the code looks right" and "the output is wrong."

Start debugging your AI apps visually.

Try Glassbrain Free