How to Detect and Fix AI Hallucinations in Production
Learn practical ai hallucination detection methods including source grounding, consistency checks, and tracing. Includes code examples and a real case study.
How to Detect and Fix AI Hallucinations in Production
Your RAG chatbot just told a customer they have 90 days to return a product. Your actual return policy is 14 days. Nobody noticed for three weeks. Sound familiar? If you are building production AI applications, ai hallucination detection methods are not optional. They are as critical as error handling or input validation. The difference is that hallucinations fail silently. There is no stack trace, no 500 error, no alert. Just wrong answers delivered with full confidence.
This guide walks through the real causes of hallucinations in production, how to detect them programmatically, and how to fix them with concrete code examples. No hand-waving about "the model sometimes makes things up." We are going deeper than that.
Why Production AI Apps Hallucinate: The 4 Root Causes
Most discussions about the ai hallucination problem stop at "LLMs generate plausible-sounding text that isn't grounded in reality." That is true but useless for debugging. In production RAG applications, hallucinations almost always trace back to one of four specific failure modes.
1. Wrong Document Retrieval
This is the most common cause and the hardest to catch. Your vector search returns documents that are semantically similar but contextually wrong. The embedding for "partner return policy" and "customer return policy" will be nearly identical. The model dutifully summarizes whatever you feed it, so garbage in, garbage out.
2. Context Window Overflow
You stuff 15 documents into the context window. The model attends to the first few and the last few, largely ignoring what is in the middle. This is the well-documented "lost in the middle" problem. The answer the user needs is in document #8, but the model synthesizes from documents #1 and #14 instead.
3. Prompt Conflicts
Your system prompt says "always provide specific numbers." Your retrieved document contains a range ("between 14 and 90 days depending on customer type"). The model resolves the conflict by picking the most confident-sounding answer, which may be the wrong one for the user's context.
4. Model Limitations and Training Bleed
The model's pre-training data contains outdated information about your domain. When retrieval results are ambiguous, the model falls back on parametric knowledge. Your pricing changed six months ago but the model still "remembers" the old numbers.
AI Hallucination Detection Methods That Actually Work
Detecting hallucinations in production requires more than spot-checking. You need automated, continuous verification. Here are three methods that work at scale.
Source Grounding Verification
The most reliable detection method is checking whether every claim in the output can be traced back to a source document. This is not as simple as string matching. You need semantic comparison with a strict threshold.
async function checkGrounding(output, sourceDocuments, threshold = 0.75) {
// Split the output into individual claims
const claims = await extractClaims(output);
const results = [];
for (const claim of claims) {
const claimEmbedding = await getEmbedding(claim.text);
let bestMatch = { score: 0, source: null };
for (const doc of sourceDocuments) {
// Compare against each sentence in the source
const sentences = doc.content.split(/[.!?]+/).filter(Boolean);
for (const sentence of sentences) {
const sentenceEmbedding = await getEmbedding(sentence.trim());
const similarity = cosineSimilarity(claimEmbedding, sentenceEmbedding);
if (similarity > bestMatch.score) {
bestMatch = {
score: similarity,
source: doc.id,
sentence: sentence.trim(),
};
}
}
}
results.push({
claim: claim.text,
grounded: bestMatch.score >= threshold,
confidence: bestMatch.score,
matchedSource: bestMatch.source,
matchedSentence: bestMatch.sentence,
});
}
const ungroundedClaims = results.filter((r) => !r.grounded);
return {
isGrounded: ungroundedClaims.length === 0,
totalClaims: results.length,
ungroundedClaims,
groundingScore:
results.reduce((sum, r) => sum + r.confidence, 0) / results.length,
};
}
Self-Consistency Checks
Run the same query multiple times with slightly varied prompts. If the model gives contradictory answers, at least one of them is hallucinated. This is expensive but effective for high-stakes outputs.
import asyncio
from collections import Counter
async def consistency_check(query, context, model, num_samples=3):
"""Run the same query multiple times and check for contradictions."""
prompts = [
f"Answer this question using only the provided context:\n{context}\n\nQuestion: {query}",
f"Based on the following information, answer the question.\nInfo: {context}\n\nQ: {query}",
f"Context: {context}\n\nUsing only the above context, respond to: {query}",
]
responses = await asyncio.gather(*[
model.generate(prompt, temperature=0.1)
for prompt in prompts[:num_samples]
])
# Extract key facts from each response
facts_per_response = []
for resp in responses:
facts = await extract_key_facts(resp)
facts_per_response.append(facts)
# Find contradictions
contradictions = []
all_facts = [f for facts in facts_per_response for f in facts]
fact_groups = group_by_topic(all_facts)
for topic, facts in fact_groups.items():
unique_values = set(f["value"] for f in facts)
if len(unique_values) > 1:
contradictions.append({
"topic": topic,
"values": list(unique_values),
"frequency": Counter(f["value"] for f in facts).most_common(),
})
return {
"consistent": len(contradictions) == 0,
"contradictions": contradictions,
"responses": responses,
}
Confidence Scoring with Logprobs
If your model exposes token-level log probabilities, low-confidence tokens in factual claims are a strong hallucination signal. A model that outputs "the return period is 90 days" with low probability on the token "90" is likely guessing.
Case Study: The Return Policy Chatbot Bug
Here is a real scenario that illustrates why detection alone is not enough. You need tracing to find the root cause.
A B2B SaaS company deployed a support chatbot using RAG. Customers started complaining that the bot was telling them they had 90 days to request a refund. The actual customer refund window was 14 days. The 90-day window was a partner/reseller policy stored in a completely different section of their knowledge base.
The grounding check would not catch this. The output was grounded in a real document. It was just the wrong document.
Finding the Root Cause with Tracing
When you trace the full request lifecycle, the bug becomes obvious. Here is what the trace revealed:
- User asked: "What is your refund policy?"
- The query embedding was generated. So far so good.
- Vector search returned 5 documents. Document #1 (highest similarity score: 0.94) was the partner refund policy. Document #3 (similarity: 0.91) was the customer refund policy.
- The model used Document #1 because it ranked highest.
The embeddings for "partner refund policy" and "customer refund policy" were nearly identical because the surrounding text was almost the same. The only difference was the audience and the number of days.
This is exactly the kind of bug that tools like Glassbrain are built to surface. When you can visually inspect each step of the pipeline, from the query to the retrieved documents to the final output, you can see that the retrieval step is where things went wrong, not the generation step.
The Fix: Metadata Filtering for Correct Document Retrieval
The ai hallucination fix here is not about the model. It is about retrieval. You need metadata filters to ensure the right documents are retrieved for the right context.
Before: Unfiltered Vector Search
// BEFORE: naive retrieval - no audience filtering
async function getRelevantDocs(query) {
const embedding = await getEmbedding(query);
const { data: documents } = await supabase.rpc("match_documents", {
query_embedding: embedding,
match_count: 5,
});
return documents;
}
After: Metadata-Filtered Retrieval
// AFTER: filtered retrieval with audience and document type metadata
async function getRelevantDocs(query, userContext) {
const embedding = await getEmbedding(query);
// Determine the audience type from user context
const audienceType = userContext.isPartner ? "partner" : "customer";
const { data: documents } = await supabase.rpc("match_documents_filtered", {
query_embedding: embedding,
match_count: 5,
filter_audience: audienceType,
filter_status: "published",
});
// Secondary check: verify document relevance
const verified = documents.filter((doc) => {
if (doc.metadata?.audience && doc.metadata.audience !== audienceType) {
console.warn(
`Filtered out mismatched doc: ${doc.id} ` +
`(audience: ${doc.metadata.audience}, user: ${audienceType})`
);
return false;
}
return true;
});
return verified;
}
Verifying the Fix Without Redeploying
You have written the fix. But how do you know it works for the specific queries that were failing? You could deploy to staging and manually test. Or you could replay the exact production requests that triggered the hallucination.
This is where replay tooling becomes invaluable. If you have been tracing your requests, you already have the exact inputs that caused the problem. Glassbrain's replay feature lets you take a captured trace, apply your code changes locally, and re-run the request to see if the output changes. No deployment needed. You can verify the fix against dozens of failing queries in minutes.
The workflow looks like this:
- Identify failing traces in your dashboard.
- Apply the metadata filter fix locally.
- Replay each failing trace with the new code.
- Confirm that the correct documents are now retrieved and the output is accurate.
- Deploy with confidence.
Prevention Strategies: Stop Hallucinations Before They Start
Better Chunking with Overlap
Small, well-bounded chunks with metadata outperform large chunks every time. Include enough overlap that no fact is split across chunk boundaries.
def chunk_with_metadata(document, chunk_size=512, overlap=64):
"""Chunk a document while preserving section metadata."""
chunks = []
text = document["content"]
current_section = document.get("default_section", "general")
for i in range(0, len(text), chunk_size - overlap):
chunk_text = text[i:i + chunk_size]
section = detect_section(chunk_text) or current_section
if section != current_section:
current_section = section
chunks.append({
"content": chunk_text,
"metadata": {
"source_doc": document["id"],
"section": current_section,
"audience": document.get("audience", "general"),
"last_updated": document.get("updated_at"),
"chunk_index": len(chunks),
},
})
return chunks
Prompt Guardrails
Tell the model explicitly what to do when it is uncertain:
const systemPrompt = `You are a support assistant. Answer questions using ONLY
the provided context documents.
Rules:
- If the context does not contain enough information to answer fully, say
"I don't have enough information to answer that accurately" and suggest
the user contact support.
- Never combine information from documents marked with different audience
types.
- When citing numbers (prices, dates, durations), always include which
document the number came from.
- If two documents contain conflicting information, mention both and ask
the user to clarify their situation.`;
Continuous Monitoring
Run your grounding checks asynchronously on a sample of production responses. Track your grounding score over time. A sudden drop means something changed in your document store or retrieval pipeline. Set up alerts for when the score drops below your threshold. Check your monitoring options to find a setup that fits your scale.
Putting It All Together
Hallucination detection and prevention in production is not a single technique. It is a pipeline:
- Chunk documents properly with rich metadata including audience, section, and freshness markers.
- Filter at retrieval time using metadata, not just vector similarity.
- Verify grounding by comparing output claims against source documents.
- Run consistency checks on high-stakes outputs.
- Trace every request so you can diagnose issues when they happen.
- Replay failing requests to verify fixes before deploying.
- Monitor continuously and alert on grounding score drops.
The hardest part of the ai hallucination problem is not that models make things up. It is that the failure is silent and the root cause is often far from the symptom. A wrong answer in the output usually traces back to a retrieval or chunking issue, not a model issue. The only way to find these bugs quickly is to have full visibility into your pipeline.
Start debugging your AI apps visually.
Try Glassbrain Free