LlamaIndex Observability: Complete Setup Guide for 2026
Add observability to your LlamaIndex RAG pipeline. What to instrument, observability options compared, and a step-by-step Glassbrain setup guide.
LlamaIndex Observability: Complete Setup Guide for 2026
LlamaIndex observability is no longer a nice-to-have for teams running retrieval augmented generation in production. It is the difference between shipping a reliable RAG pipeline and shipping a black box that silently degrades over time. LlamaIndex has become one of the most popular frameworks for building RAG systems, agent workflows, and structured data extraction pipelines, but the framework's power comes from composition. A single query engine call can trigger retrieval from a vector store, embedding generation, query rewriting, sub-query decomposition, multiple LLM calls, response synthesis, and tool invocations. When any one of those stages misbehaves, the final answer looks wrong, but the root cause hides inside a chain of operations that you never see unless you have proper llamaindex tracing in place.
This guide walks through everything you need to know about llamaindex observability in 2026. We will cover why RAG pipelines demand more visibility than single LLM calls, what to instrument at each layer of a LlamaIndex app, the main observability options available, and a step by step setup for Glassbrain, the simplest option for teams that just want traces to work without rewriting their agent code. We will also look at the trace tree you get, common llamaindex debugging patterns, and answer the questions we hear most often from teams running LlamaIndex in production.
If you are building anything beyond a toy RAG demo, you need llamaindex observability from day one. The cost of adding it later, after you have a production incident and a confused user, is always higher than the cost of wiring it in now. And with modern SDKs that wrap the underlying OpenAI or Anthropic clients, setup takes one line and works with any LlamaIndex version.
Why LlamaIndex RAG Needs Observability
A naive LLM call has two places where things can go wrong. The prompt and the response. A LlamaIndex RAG pipeline has roughly ten. Each stage can fail silently, and each failure mode looks different from the outside. Without llamaindex observability, you are guessing at which stage broke.
Start with retrieval quality. When a user asks a question, LlamaIndex queries a vector store to find relevant documents. The quality of this retrieval determines the ceiling of your answer quality. If retrieval returns irrelevant chunks, no amount of prompt engineering on the synthesis step will produce a correct answer. But retrieval failures are invisible from the user side. The model dutifully generates a plausible sounding response based on whatever it was given, and you never know the retriever missed the one paragraph that contained the actual answer. Llamaindex tracing lets you see exactly which document chunks came back for each query, with their similarity scores, so you can spot retrieval problems the moment they happen.
Then there are embedding costs. Every query in a LlamaIndex pipeline typically generates an embedding for the question, and every document ingestion generates embeddings for each chunk. If you are running a production pipeline with thousands of queries per day and tens of thousands of documents, embedding costs can silently eat your budget. Without tracing, you discover this at the end of the month when your OpenAI bill arrives. With proper llamaindex observability, you see embedding call counts and token usage in real time.
LLM calls inside a query engine are the next trouble spot. LlamaIndex often makes multiple LLM calls per query. One for query rewriting, one or more for sub-query generation, one per sub-query answer, and one for final synthesis. Each call has its own latency, token count, and failure mode. A single slow LLM call in the middle of the chain can make your p99 latency look terrible while your p50 looks fine.
Query decomposition is specific to LlamaIndex and particularly hard to debug without tracing. When a user asks a complex question, LlamaIndex can break it into sub-queries, answer each one separately, and combine the results. If the decomposition step misunderstands the question, every downstream sub-query inherits the error. Llamaindex debugging without tracing means reading logs and trying to reconstruct the decomposition by hand.
Finally, response synthesis. The synthesizer takes retrieved context and sub-query answers and produces the final response. Bugs here look like hallucinations, but they are often synthesis prompt issues. Tracing the synthesizer call with its full input context reveals whether the model saw the right information and chose to ignore it, or never had the information in the first place.
What to Instrument in a LlamaIndex App
Good llamaindex observability instruments every stage of the pipeline. Skipping any one stage leaves a blind spot that will eventually bite you. Here is what to cover.
Retrieval step
Every call to a retriever should be traced with the query string, the top-k setting, the documents returned, and the similarity scores. This is the single most valuable signal in a llamaindex rag pipeline. When answers are wrong, the retrieval trace tells you whether the problem is upstream of the LLM or inside the LLM.
Embedding generation
Both query embeddings and document embeddings should be traced with their model name, token count, and latency. This catches cost spikes and lets you compare embedding model versions when you upgrade.
Query engine LLM call
The main LLM call in the query engine needs full input and output tracing, including the system prompt, the retrieved context, and the final response. This is where hallucinations are caught. If the trace shows the context contained the right answer but the response did not, you have a synthesis prompt problem.
Response synthesizer
LlamaIndex response synthesizers can use different strategies like refine, compact, or tree-summarize. Each strategy makes a different number of LLM calls with different inputs. Tracing the synthesizer reveals which strategy is being used, how many intermediate calls it makes, and where time is being spent.
Sub-queries
When a query engine decomposes a question into sub-queries, each sub-query is its own mini-pipeline with its own retrieval and LLM call. Tracing sub-queries as child spans under the parent query makes it possible to see the full decomposition tree at a glance.
Tool calls
Modern LlamaIndex applications often use agents that call tools such as search, code execution, or external APIs. Each tool call is a potential failure point and needs to be traced with its inputs, outputs, and duration. Llamaindex tracing without tool call visibility is incomplete for any agentic pipeline.
Observability Options for LlamaIndex
Several tools offer llamaindex observability, each with different tradeoffs. Here is an honest comparison.
Built-in LlamaIndex Callbacks
LlamaIndex ships with a callback system that lets you hook into events at each pipeline stage. This is powerful and free, but you have to build the visualization, storage, and querying layer yourself. For teams with existing internal observability infrastructure, callbacks are a reasonable starting point. For teams that just want to see their traces, callbacks alone are not enough. You will end up reinventing a tracing UI.
OpenTelemetry plus Phoenix
Arize Phoenix is an open source tracing UI that speaks OpenTelemetry. It has good LlamaIndex integration and you can self-host it. The tradeoff is operational complexity. You need to run Phoenix, configure OpenTelemetry exporters, manage storage, and handle upgrades. For teams that need full data residency control, this is worth it. For teams that want to ship fast, the operational overhead is a real cost.
Glassbrain SDK Wrapping
Glassbrain takes a different approach. Instead of asking you to instrument LlamaIndex directly, it wraps the OpenAI or Anthropic SDK that LlamaIndex uses underneath. One line of code. No callback registration, no OpenTelemetry configuration, no self-hosting. Because LlamaIndex calls the standard OpenAI and Anthropic clients for its LLM and embedding operations, wrapping those clients captures every LlamaIndex operation automatically. You get 1,000 traces per month on the free tier with no credit card required, JS and Python SDKs, replay without sharing user keys, AI fix suggestions on failed traces, and a visual trace tree. The tradeoff is that you are using a hosted service, so for teams with strict self-hosting requirements, this will not fit.
LangSmith
LangSmith is LangChain's tracing product and also supports LlamaIndex through OpenTelemetry compatibility. If you are already on the LangChain ecosystem, it is a natural choice. The tradeoff is that the llamaindex experience feels like a second citizen compared to LangChain, and pricing can ramp up quickly for high volume pipelines.
Setting Up Glassbrain with LlamaIndex
Setting up Glassbrain for llamaindex observability takes about two minutes. Because LlamaIndex uses OpenAI or Anthropic SDKs under the hood for embeddings and LLM calls, wrapping those clients is all you need. Here is the step by step.
First, install the SDK. If you are on Python, which is the most common LlamaIndex language, run pip install glassbrain. If you are using JS, npm install glassbrain. The install is small, has no exotic dependencies, and does not require any build step changes.
Second, sign up for a free Glassbrain account. You get 1,000 traces per month with no credit card. Grab your API key from the dashboard. You will set this as an environment variable.
Third, wrap the underlying client. If your LlamaIndex app uses OpenAI, import wrap_openai from glassbrain and wrap the OpenAI client before passing it to LlamaIndex. If your app uses Anthropic, use wrap_anthropic. In practice, LlamaIndex lets you pass a custom llm and embed_model, both of which accept an already-configured client. Pass the wrapped client, and every LLM call and embedding call that LlamaIndex makes through that client will be traced automatically.
Fourth, run your LlamaIndex query as normal. The wrapped client intercepts each call, records the inputs, outputs, token counts, latency, and any errors, and ships the trace to Glassbrain in the background. Your llamaindex rag pipeline does not need any other changes. No callbacks to register, no context managers to wrap queries, no spans to create manually.
Fifth, open your Glassbrain dashboard. Within a few seconds of running a query, the trace appears in the visual trace tree. You see the parent query, each retrieval call, each embedding, each LLM call, and the full input and output of every span. Click any node to see the full context. If the call failed, Glassbrain shows an AI fix suggestion that often points directly at the misconfigured prompt or missing parameter.
That is the entire setup. No yaml, no OpenTelemetry collector, no self-hosted UI. Because Glassbrain works at the SDK level rather than the framework level, it works with any LlamaIndex version, including future versions that have not shipped yet. As long as LlamaIndex is calling OpenAI or Anthropic underneath, Glassbrain traces it. This is a deliberate design choice. Framework-level integrations break every time the framework releases a new version. SDK-level wrapping does not.
What You See in the Trace
Once traces start flowing, the visual trace tree is where llamaindex debugging actually happens. The root of the tree is the top-level query. Under it, you see a retrieval span showing which documents were fetched from the vector store, with their similarity scores and chunk text previews. Next to the retrieval, you see an embedding span for the query itself, showing the embedding model, token count, and latency.
Below that, you see the LLM call span. This is where the query engine asks the model to answer the question based on the retrieved context. The span shows the full system prompt, the full user message including the retrieved chunks, the model response, the token counts, and the cost. If response synthesis uses a multi-step strategy like refine, you see a synthesizer span with multiple child LLM calls, each representing one refinement step.
For query engines that use sub-query decomposition, the tree gets deeper. The top-level query has child sub-query spans, each of which has its own retrieval, embedding, and LLM call children. You can see at a glance how the question was broken down, which sub-queries took the longest, and which ones returned empty results.
Click any node in the tree to see the full context on the right side panel. Full prompts, full responses, full retrieved documents, full tool inputs and outputs. Nothing is truncated. The tutorial in the dashboard walks new users through reading the trace tree the first time. If a call failed, the node turns red and the AI fix suggestion appears in the side panel, often naming the exact issue and a one-line fix.
Debugging Common LlamaIndex Issues
Here are the llamaindex debugging patterns we see most often and how traces help.
Retrieval returning irrelevant docs
Symptom: the model gives wrong or vague answers for questions that should be answerable from your data. Open the trace, click the retrieval span, and read the returned chunks. If the top-k chunks are obviously unrelated, your embedding model or chunking strategy is the problem, not your prompt. Traces make this diagnosis a ten second check.
Embedding cost spikes
Symptom: monthly bill is higher than expected. Filter Glassbrain traces by embedding span, sort by token count, and you immediately see which queries or ingestion runs are the culprits. Often the cause is an accidental re-embedding of an entire corpus, or a pipeline that embeds the same query multiple times.
Slow response synthesis
Symptom: p99 latency is bad. Open a slow trace and look at the synthesizer span. If it has many child LLM calls, you are using a refine or tree-summarize strategy on a large context and paying for it with latency. Switching to compact or reducing top-k often fixes this, and the trace makes the decision data-driven.
Query decomposition bugs
Symptom: the model answers a different question than the user asked. Open the trace, look at the sub-query spans, and read the decomposed queries. If the decomposition mangled the user intent, the fix is in the query decomposition prompt, not in retrieval or synthesis. This is one of the hardest bugs to catch without tracing, because logs alone rarely show the intermediate sub-queries clearly.
Frequently Asked Questions
Does Glassbrain work with LlamaIndex out of the box?
Yes. Because LlamaIndex uses the OpenAI or Anthropic SDK underneath for LLM and embedding calls, wrapping those SDKs with wrap_openai or wrap_anthropic automatically traces every LlamaIndex operation. No LlamaIndex-specific configuration is required.
What about async LlamaIndex queries?
Async is fully supported. The wrapped client handles both sync and async methods on the OpenAI and Anthropic SDKs, and LlamaIndex async queries produce the same trace tree as sync queries.
Does it trace vector database calls?
The LLM and embedding calls are traced at the SDK level. Vector database calls happen inside LlamaIndex retriever code rather than through OpenAI or Anthropic, so they show up as part of the retrieval span with the documents returned, but the raw vector DB call itself is not a separate span unless you add that instrumentation yourself.
Can I see retrieval scores in the trace?
Yes. Similarity scores returned by the retriever appear in the retrieval span payload along with the document chunks. This makes it easy to spot cases where the top result has a low score, which usually means the question is out of distribution for your index.
What about streaming responses?
Streaming is supported. The wrapped client captures the full response once the stream completes and records it as a single span, so streaming LlamaIndex queries produce the same traces as non-streaming queries.
Is there a LlamaIndex-specific integration?
There does not need to be. SDK-level wrapping catches everything LlamaIndex does through OpenAI and Anthropic. This is more robust than a framework-specific integration because it does not break when LlamaIndex releases new versions. If you are using a different LLM provider that LlamaIndex supports, reach out and we can add SDK wrapping for that provider.
Related Reading
- Debugging LLM Agents: A Practical Guide
- How to Add LLM Tracing Without Rewriting Your Code
- The Complete LLM Observability Guide
Observability for LlamaIndex in one line.
Try Glassbrain Free