LLM Evaluation: How to Test AI Apps That Are Not Deterministic

LLM Evaluation: A Practical Guide for Developers

LLM evaluation is the process of measuring whether a large language model, prompt, or agent actually does the job you built it for. It sounds simple until you try it. Unlike traditional software, where a function either returns the right value or it does not, language models produce outputs that are probabilistic, verbose, and context dependent. The same prompt can return ten different valid answers, and all of them might be correct in spirit while failing a naive string comparison. This is why most developers who try to write unit tests for their LLM features end up with a brittle suite that breaks every time the model is updated or the temperature is nudged.

Evaluating LLMs is closer to statistical testing than it is to assert equals. You care about how often the system gets close enough, how reliably it stays inside a schema, how well it handles edge cases, and how badly it fails when it fails. A good llm evaluation workflow gives you a score you can track across prompt versions, model versions, and deployments, so you can answer the only question that matters during a refactor: did this change make the product better or worse.

This guide walks through what to evaluate, which llm testing techniques actually work in practice, how to build a pipeline around them, and how tools like Glassbrain fit into the picture as the trace data layer that feeds your evals.

What You Are Actually Evaluating

Before you pick a metric, you need to be honest about what you are grading. Most teams that struggle with llm evals skipped this step and jumped straight into cosine similarity or BLEU scores, then wondered why their numbers had nothing to do with user satisfaction. There are four dimensions worth measuring, and every serious ai evaluation setup touches all of them.

Correctness is whether the output answers the question, solves the task, or retrieves the right facts. For a RAG system this means the answer is grounded in the retrieved documents. For a code generator it means the code compiles and passes tests. For a classifier it means the label is right. Correctness is the most important axis and also the hardest to automate, because right answers come in many shapes.

Format is whether the output is machine readable in the way your downstream code expects. If you ask for JSON and get JSON wrapped in a markdown fence with an apology paragraph above it, your parser will crash in production. Format checks are cheap, deterministic, and catch a surprising share of real bugs.

Tone and style are whether the voice matches the brand, the reading level fits the audience, and the length is appropriate. This is squishier but still measurable, especially with an llm-as-judge rubric.

Safety covers refusals, hallucinations, leaked system prompts, toxic content, and PII exposure. Even if you do not build a chatbot for the public, any user facing LLM can be coaxed into embarrassing output, and you want to catch it before a customer does.

Teams skip this taxonomy because it feels like process work. The cost of skipping it is an eval suite that reports a healthy 0.87 score while your product is visibly broken.

The Main LLM Evaluation Techniques

Exact Match and Substring Match

The simplest llm evaluation technique is to compare the model output to a reference string. Exact match works when there is exactly one right answer, like a math problem, a SQL query against a known schema, or a classification label. Substring match relaxes this by checking whether the expected answer appears anywhere in the response, which is useful when the model likes to add filler like "The answer is" before the actual token you care about. Both are blazingly fast, free, and fully deterministic, which makes them ideal for the first line of defense in a CI pipeline. The obvious weakness is that they punish paraphrasing. If the golden answer is "Paris" and the model says "The capital of France is Paris", substring match passes, but if it says "It is Paris, of course", exact match fails. Use these checks for structured tasks and short answers, and pair them with a more forgiving metric for free form generation. A good pattern is to normalize both strings first, lowercasing, stripping punctuation, and trimming whitespace, which removes a whole class of false negatives without hiding real bugs.

Schema Validation

If your prompt asks the model to return structured output, schema validation is the highest value eval you can add, and it costs almost nothing. Define a JSON Schema, a Pydantic model, or a Zod object that describes the shape you expect, then try to parse the model response against it. A pass means the object has the required fields, the types match, the enums are respected, and any nested objects are well formed. This catches a huge category of production incidents, like the model returning a string where you expected a number, dropping a required field, or wrapping the JSON in prose. Schema validation also gives you a clean binary signal that is easy to track over time, so you can see whether a new model version regressed your structured output reliability. Run it on every single trace in your eval dataset and on a sample of production traffic. If your pass rate ever dips below 99 percent for a critical path, something has broken upstream, usually a prompt edit or a temperature change.

Embedding Similarity

Embedding similarity compares the meaning of two pieces of text by turning each into a vector and measuring the cosine distance. This is how you evaluate free form generation when exact match is too strict. You embed the golden answer, embed the model output, and check whether they are close enough in vector space. It handles paraphrasing gracefully, since "The capital of France is Paris" and "Paris" land near each other. The catch is that embeddings encode topic more than correctness, so a wrong answer about Paris can score higher than a right answer about Lyon. Treat embedding similarity as a fuzzy filter, not a grader. Use it to flag outputs that are far from the reference for human review, and combine it with other metrics before making a pass or fail decision. Pick an embedding model that is cheap and stable, since you will run it millions of times across your llm evaluation dataset and you do not want your scores drifting every time a provider updates their model.

LLM-as-Judge

LLM-as-judge uses a strong model, usually a frontier one, to grade the output of the model under test. You write a rubric, something like "Score this answer from 1 to 5 on factual accuracy, citing the passages from the reference", and the judge returns a score plus an explanation. This is the most flexible llm evaluation technique because you can grade anything you can describe in words, including tone, helpfulness, and reasoning quality. It is also the most dangerous if you are sloppy. Judges have biases. They prefer longer answers, they prefer their own outputs, and they are inconsistent run to run. Mitigate this by using a different model family than the one you are testing, by forcing a structured output with a fixed scale, by running each judgment twice and averaging, and by calibrating the judge against a small human labeled set. When it is tuned well, llm-as-judge correlates closely with human ratings and scales to datasets that would be impossible to grade by hand. It is the backbone of modern llm evals at most serious teams.

Human Review

No automated metric fully replaces a human reading the output. Human review is slow and expensive, so you do not run it on every trace, but you should run it on a small weekly sample and on every flagged failure. Build a simple review interface where a teammate can see the input, the model output, the expected behavior, and a handful of buttons: pass, fail, edge case, hallucination. Store the labels. Over time this dataset becomes the ground truth you use to calibrate your automated evals, and it is the single most valuable asset in your evaluating llms workflow. Humans also catch issues no metric can see, like weird phrasings that undermine trust, subtle brand voice drift, or an answer that is technically correct but unhelpful. Keep the review set small enough that people will actually do it, around fifty to two hundred samples per cycle, and rotate the reviewers so you do not bake in one person taste.

Regression Testing on Saved Traces

The highest leverage move in llm testing is to save every interesting production trace and replay it after any change. When a user hits a bug, capture the full trace, add it to a regression set, and from then on every prompt edit, model upgrade, or config change runs against that set before it ships. This gives you a direct answer to "did I break anything that used to work", which is exactly the question unit tests answer for normal code. The trick is having the trace data in the first place. You need the full input, the tool calls, the retrieved context, the model output, and enough metadata to rerun the call exactly. This is where an observability tool matters, because retrofitting trace capture after an incident is painful. Glassbrain captures traces with a one line SDK install, stores them with replay built in, and lets you pull the failing cases straight into your eval dataset without wiring up any keys.

How to Build an LLM Evaluation Pipeline

A working llm evaluation pipeline is not a notebook you run once. It is a loop that runs continuously, improves over time, and gates your deployments. Here is the sequence that works in practice.

Build a dataset. Start small. Twenty to fifty hand picked examples that cover the common happy paths and the three or four failure modes you have seen in development. Do not try to be comprehensive on day one. The goal is to have something you can run in under a minute so you will actually run it. Each row needs an input, optionally a reference answer, and a tag describing what the case is testing.
Define your metrics. For each test case, decide which of the techniques above apply. Schema validation on structured outputs. Substring match on short factual answers. Embedding similarity plus llm-as-judge on free form generation. Write the metrics once as functions that take input and output and return a score. Keep them pure so they are easy to debug.
Run on every prompt change. Wire the eval suite into your development loop so it runs locally before you commit and in CI before you merge. Fail the build if any critical metric drops below its threshold. This is the single habit that separates teams that ship reliable LLM features from teams that ship and pray.
Track scores over time. Log every run with a timestamp, the prompt version, the model version, and the scores per metric. Plot the trend. You want to see the line go up when you improve things, and you want to catch the drop the moment a change regresses the product. Without a trend line, you are flying blind.
Feed failures back into the dataset. Every time a user reports a bug or a judge flags a bad output, add that case to the dataset with the correct expected behavior. Your eval suite should grow with your product, and your regression coverage should compound over time. This is how you turn an incident into an asset instead of a fire drill.

Do this for a month and your llm evals will catch problems before users do, instead of after.

LLM Evaluation in Production

Offline evaluation on a fixed dataset is necessary but not sufficient. Production traffic has inputs you never imagined, users who phrase things in ways your dataset missed, and distribution shifts that only show up at scale. You need to evaluate what is actually happening, not just what you tested.

The pattern that works is sampling plus automated scoring plus replay. Sample a small percentage of live traces, say one to five percent, and run your eval metrics on them in the background. Schema validation and substring match are cheap enough to run on everything. Llm-as-judge is expensive, so reserve it for the sample. When a metric fails, flag the trace. When enough flags pile up in the same hour, page someone. This gives you a real time signal about the health of your LLM feature that is grounded in user facing quality, not in token counts or latency.

When a failure comes in, you need to be able to replay it. Replay means rerunning the exact same input through the exact same prompt and tools so you can see what went wrong and test a fix. This is only possible if you captured the full trace at the time of the call, including any retrieved context and intermediate tool outputs. Glassbrain is built around this workflow. You install the JS or Python SDK in one line, it captures every LLM call as a visual trace tree, and replay is built in with no user keys to configure. You get one thousand traces per month free with no card, which is usually enough to cover a small product production sampling plus a developer local work. Glassbrain is observability first and evals are on the roadmap, so think of it as the trace data layer that powers your production evals rather than a full eval framework on its own. It also ships AI fix suggestions, which is useful when you want a starting point for debugging a failing trace instead of staring at a wall of JSON.

Common LLM Evaluation Mistakes

Most llm evaluation setups that fail share the same handful of mistakes. The first is a tiny dataset that never grows. Twenty examples is a fine place to start, but if you are still running twenty examples six months later, your evals are not protecting you. Every bug report should become a test case.

The second is picking the wrong metric. Teams reach for BLEU or ROUGE because they are in the literature, then discover these scores have almost no correlation with whether the product is useful. Pick metrics that match the failure modes you actually care about, and validate them against human judgment on a small set before trusting them.

The third is no production sampling. Offline evals on a curated dataset will always look healthier than reality because the dataset is full of cases you already solved. If you do not measure what is happening in production, you will be the last person to know when quality drops.

The fourth is no regression tracking. Running evals once before launch and never again is not evaluation, it is a screenshot. You need a trend line per metric per prompt version so you can see whether things are getting better or worse.

The fifth is judging on style instead of function. It is tempting to grade outputs on how polished they sound, but a pretty wrong answer is still wrong. Put correctness and format ahead of tone in your scoring, and weight them accordingly.

LLM Evaluation Tools to Consider

The llm evaluation tool space moved fast in the last two years. Here is a quick map of what each major option is actually good at, so you can pick without reading six marketing pages.

Tool	Best for
Braintrust	End to end eval platform with dataset management, experiments, and a hosted UI. Good fit for teams that want a managed workflow and do not mind a paid plan once they scale.
Confident AI and DeepEval	Open source Python framework with a large library of prebuilt metrics including faithfulness, answer relevancy, and hallucination detection. Pairs well with pytest.
LangSmith evals	Tight integration with LangChain apps, built in tracing, and a decent dataset and experiment UI. The obvious choice if you are already on LangChain.
Promptfoo	YAML driven llm testing tool that runs in CI and shines for prompt comparison across models and providers. Low ceremony, fast to adopt, great for prompt engineers.
Glassbrain	Trace capture and visual debugging with replay and AI fix suggestions. Positions as the trace data layer that feeds your evals, with one line SDK install, no user keys, and one thousand free traces per month.
Phoenix by Arize	Open source observability and evaluation for LLM apps, strong on RAG evaluation and drift monitoring. Good for teams that want to self host.

These tools are not mutually exclusive. A common stack is Glassbrain or Phoenix for capturing traces and debugging failures, plus Promptfoo or DeepEval for the offline eval suite that runs in CI. Pick the combination that matches how your team already works instead of trying to consolidate everything into one vendor on day one.

Frequently Asked Questions

What is LLM evaluation?

LLM evaluation is the practice of measuring how well a language model, prompt, or agent performs on the tasks you built it for, using a combination of automated metrics, human review, and production sampling. It answers the question "is this version better than the last one" with numbers instead of vibes.

What is the difference between LLM evaluation and LLM monitoring?

Monitoring tracks operational signals like latency, token counts, error rates, and cost. Evaluation tracks quality signals like correctness, format compliance, and user helpfulness. You need both. Monitoring tells you the service is up, evaluation tells you the service is good.

How big should my LLM eval dataset be?

Start with twenty to fifty hand picked examples and grow from there. Most production eval suites end up in the range of two hundred to a few thousand cases. Quality and coverage of failure modes matter far more than raw size. A hundred well chosen cases beat ten thousand random ones.

What is LLM-as-a-judge?

LLM-as-a-judge is an evaluation technique where you use a strong model to grade the output of another model against a rubric. It is flexible enough to score open ended generation and correlates well with human ratings when the rubric is clear and the judge is calibrated. Use a different model family than the one you are testing to reduce self preference bias.

Can I evaluate LLMs without a golden dataset?

Yes. Reference free metrics like schema validation, llm-as-judge on a rubric, self consistency checks, and safety classifiers all work without golden answers. You can also use pairwise comparison, where a judge picks the better of two candidate responses, which avoids the need for ground truth entirely.

How often should I run LLM evals?

Run the fast subset of your suite on every commit and every prompt edit. Run the full suite, including llm-as-judge and any expensive metrics, before every deployment. Run production sampling continuously in the background. If you only remember one rule, it is that any change to a prompt or model should trigger evals before it ships.

Conclusion

Llm evaluation is the habit that separates teams who ship reliable AI features from teams who ship and debug in production. The mechanics are not complicated. Pick a handful of metrics that match your failure modes, build a small dataset you actually run, wire it into CI, track the scores over time, and feed production failures back into the loop. Add human review on a weekly sample to keep your automated metrics honest, and sample live traffic so you catch the problems your offline dataset never saw.

The part that is hard is not the metrics. It is having the trace data to run them on. Without full captures of your LLM calls, including inputs, tool outputs, retrieved context, and final responses, you cannot replay failures, you cannot build regression sets from real incidents, and you cannot score production behavior. Get that foundation in place first, then layer evals on top. Whether you use Glassbrain, Phoenix, LangSmith, or your own logging, the rule is the same: capture everything, make it replayable, and treat your eval suite as a living asset that grows every time something breaks. That is how llm testing stops being a chore and starts being the thing that keeps your product honest.