LLM Evaluation (Evals) — How to Measure AI Application Quality with LLM-as-a-Judge
LLM evaluation (evals) is the systematic measurement of an AI application's response quality across a set of representative examples — the equivalent of automated tests, but for outputs that have no single correct answer. The most effective method in 2026 is LLM-as-a-judge: a second language model scores your application's responses against a defined rubric, agreeing with human judgement ~85% of the time (more often than two humans agree with each other) and costing 500–5,000× less than manual review. If you cannot answer the question "is the new prompt better than the old one" with a number, you have no evals and you are changing your AI application blind. The foundation is a golden dataset built from real production failures, not examples invented at a desk.
The complete guide to evaluating LLM applications: how evals differ from tests, how to build a golden dataset from real failures, how LLM-as-a-judge works and why it agrees with humans ~85% of the time, which metrics to measure (faithfulness, hallucination, answer relevancy), how to avoid judge biases, how to wire evals into CI/CD, and which tool to choose — DeepEval, Ragas, Promptfoo or Braintrust.
You change one sentence in the system prompt because the chatbot answered one customer badly. You ship the fix. A week later it turns out that change broke responses in three other scenarios nobody checked. Sound familiar? That is daily life for teams developing AI applications without evaluation — every change is a roulette, because nobody measures what actually improved and what broke.
Evaluation turns that roulette into engineering. Instead of "seems better" you get a number: 87% of responses meet the criteria, up from 81%. That is what separates a professional AI team from an amateur one — and it is most often what separates a deployment that works from a demo that fell apart in production. This article shows how to build evals from scratch: golden dataset, metrics, LLM-as-a-judge, judge biases, CI/CD and tool selection.
Evals vs tests — what's the difference?
This is the first source of confusion. In the article on testing AI applications I described tests in the software-engineering sense — does the code work, does the API return the right format, does the pipeline not crash. Evals are something else: they measure the quality of the model's response, which is inherently fuzzy.
| Aspect | Classic tests | Evals (LLM evaluation) |
|---|---|---|
| What they check | Whether the code works correctly | Whether the model's response is good |
| Result | Pass / fail (binary) | A score on a scale or rubric |
| Determinism | Same input = same result | Same input = different outputs |
| Correct answer | One, known in advance | Many acceptable variants |
| Example | assert format == JSON | "Is the answer grounded in the context?" |
| Tools | pytest, jest | DeepEval, Ragas, Promptfoo |
The key difference: in a classic test "2 + 2" always equals "4". In an LLM application the prompt "summarize this contract" has dozens of good answers and infinitely many bad ones — and you cannot check that with a simple string comparison. So you need a qualitative judgement that can be automated and repeated. That is what evals do.
Both approaches are needed and complement each other: tests ensure the application works at all, evals ensure it works well. Without tests the app crashes; without evals it silently degrades in quality with every prompt or model change.
Golden dataset — the foundation of every evaluation
An evaluation is only as good as the set of examples you run it on. A golden dataset is a collection of representative input cases (often with expected answers or criteria) on which you measure quality with every change. Four rules decide its worth:
- Build it from real failures, not invented examples — the most valuable cases are the ones the app has already failed on in production; every reported bug is a new row in the golden set, so the same failure never returns unnoticed
- Size of 200–500 examples to start — enough for the result to be statistically meaningful, few enough for the eval to be fast and cheap; grow it as you discover new edge cases
- Cover diversity, not just volume — typical, edge, hard and adversarial cases (prompt injection), in different languages and registers; 500 variants of the same question are worth less than 50 genuinely different ones
- Version it like code — keep the golden dataset in the repository, review changes via pull requests; it is a living artifact that grows with the application
The most common beginner mistake: generating the golden set synthetically with the model itself. Such a set measures how well the model agrees with itself, not how well it handles real, unpredictable user queries. Synthetic examples are fine as a supplement, but the core must come from real traffic.
LLM-as-a-judge — how a model evaluates a model
/// LLM QUALITY EVALUATION PIPELINE
From production logs to a quality alert
Since responses cannot be checked with a string comparison, and manual review does not scale to thousands of examples on every change — who should judge? The 2026 answer: a second language model as the judge. LLM-as-a-judge is a technique where a strong model (e.g. GPT-4o, Claude) receives your application's response along with a scoring rubric and returns a verdict — a numeric score, a label or a comparison.
Why does it work? Because judging is easier than generating. It is much simpler for a model to state "is this answer grounded in the given context" than to generate the perfect answer itself. The data confirms the effectiveness: an LLM judge agrees with human reviewers ~85% of the time — more than two humans agree on the same task — at a cost 500–5,000× lower.
There are three main scoring modes:
- Pointwise — the judge scores one response at a time against criteria (e.g. "rate relevance on a 1–5 scale"); the simplest and most common
- Pairwise — the judge compares two responses and picks the better one; ideal for comparing two prompt versions or two models, because relative judgement is more stable than absolute
- Reference-based — the judge compares the response with a reference; used when the golden set contains expected answers
The most important rule: the judge's rubric is a product you have to refine. The more specific the criteria, the higher the agreement with humans. "Rate the quality" gives random results; "Rate whether the answer (1) addresses the question, (2) is grounded solely in the given context, (3) contains no fabricated facts — return Yes/No for each" gives a repeatable, sensible score.
Which metrics to measure
You match metrics to the application type. Different ones matter for a RAG chatbot than for an agent with tools. The most important:
| Metric | What it measures | For which application |
|---|---|---|
| Faithfulness | Whether the answer follows from the context, no fabrication | RAG, document Q&A |
| Answer relevancy | Whether the answer actually addresses the question | Any Q&A application |
| Context precision/recall | Whether retrieval fetched the right chunks | RAG (retrieval layer) |
| Hallucination | Whether the model fabricated facts outside the context | RAG, summaries, facts |
| Task completion | Whether the agent accomplished the user's task | Agents with tools |
| Tool correctness | Whether the agent chose and called the right tool | Agents, function calling |
| Toxicity / bias | Whether the response is safe and neutral | Public apps, customer support |
For RAG applications (most business deployments) the gold standard is a triad: faithfulness (does it not fabricate), answer relevancy (does it answer the question) and context precision (does retrieval work). These three metrics separate two different error sources — poor retrieval (a bad search layer) from poor generation (the model got good context but used it badly) — which is crucial, because each is fixed differently. I covered the retrieval layer in the article on advanced RAG.
LLM-as-a-judge pitfalls — judge biases
The LLM judge is not objective. It has documented, systematic tendencies that will skew results if you do not know about them and counteract them:
- Position bias — in pairwise scoring the judge favors the response in the first (or last) position regardless of content; counter: score each pair in both orders and average
- Verbosity bias — the judge prefers longer, wordier responses, mistaking length for quality; counter: explicitly reward conciseness and penalize fluff in the rubric
- Self-preference bias — the judge favors responses generated by the same model it is; counter: use a different model for judging than for generation
- Sycophancy — the judge agrees with suggestions embedded in the prompt ("is this great answer good?"); counter: write rubrics neutrally, without hinting at the expected verdict
The most important safeguard: calibrating the judge against a human. Before you trust automatic scores, have a human manually rate 50–100 examples, run the LLM judge on them and compute agreement. Aim for 85–90%. If agreement is lower, refine the rubric and repeat. An uncalibrated judge is a number generator you cannot trust; a calibrated judge is a reliable, cheap reviewer running 24/7.
Evaluation in code — a DeepEval example
The shortest path to a working eval is DeepEval — you write it like a pytest test. Below is an evaluation of a RAG response on three metrics at once:
from deepeval import evaluatefrom deepeval.test_case import LLMTestCasefrom deepeval.metrics import ( FaithfulnessMetric, AnswerRelevancyMetric, ContextualPrecisionMetric,)# A golden-dataset example: question + retrieval context + app responsetest_case = LLMTestCase( input="What is the notice period in the contract?", actual_output="The notice period is 3 months.", expected_output="3 months", retrieval_context=[ "Sec. 8: The agreement may be terminated with " "a three-month notice period." ],)# The LLM judge scores each metric; threshold 0.7 = quality gatemetrics = [ FaithfulnessMetric(threshold=0.7, model="gpt-4o"), AnswerRelevancyMetric(threshold=0.7, model="gpt-4o"), ContextualPrecisionMetric(threshold=0.7, model="gpt-4o"),]results = evaluate(test_cases=[test_case], metrics=metrics)Three things that make a difference here:- **model="gpt-4o" is the judge, not the app** — you use a strong model different from the one generating responses (avoiding self-preference bias)- **threshold=0.7 is the quality gate** — below it the test fails; this is the threshold that blocks a merge in CI when a change degrades quality- **retrieval_context separates retrieval from generation** — faithfulness checks whether the answer follows from the context; context precision checks whether retrieval supplied the right chunk; you separate two error sources
In production you do not run a single case but the whole golden dataset (200–500 rows), and you aggregate the result: "92% passed the faithfulness threshold". That percentage is the number you compare between versions.
Evals in CI/CD — quality gates
/// DEEPEVAL vs RAGAS vs PROMPTFOO vs BRAINTRUST — WHICH TOOL?
The full value of evals appears when you wire them into the pipeline like tests. Mature production evaluation in 2026 is four stages with automated quality gates:
- 1.Local development — the developer iterates on the prompt, running DeepEval or Promptfoo on the golden set like unit tests; a feedback loop in seconds
- 2.PR / merge (CI) — on every pull request an automated eval runs on the full golden set; if quality drops below the threshold, the gate blocks the merge — exactly like a failing test
- 3.Staging — a regression eval compares the new version with the previous one; it catches silent regressions on known cases before they reach users
- 4.Production — an online eval on a sample of real traffic; the judge scores a random percentage of live responses, and an alert fires on a quality drop (this connects to the AI monitoring from a separate article)
The tool choice depends on the need — and in practice mature teams combine two of the three open-source ones:
- DeepEval — when you want LLM unit tests integrated with the pipeline (pytest-style, 14+ metrics, custom plain-English rubrics); the default choice for CI/CD
- Ragas — when you deeply evaluate RAG; research-backed retrieval and generation metrics, the most cited in academic papers; often added to DeepEval
- Promptfoo — when you need red teaming and security validation (prompt injection!) alongside prompt evaluation; YAML config, no code
- Braintrust — when you want one platform connecting the whole lifecycle: dataset, scoring, production monitoring and CI gates in one place (a commercial SaaS)
The rule: start with DeepEval (or Promptfoo if you prefer YAML), add Ragas when RAG needs deeper analysis, and reach for a platform like Braintrust when the team grows and you want everything in one tool.
LLM evaluation deployment checklist
- 1.Build a golden dataset from real production failures — 200–500 diverse examples, not synthetic ones
- 2.Version the golden set in the repository and review changes via pull requests
- 3.Match metrics to the application type — for RAG the triad: faithfulness, answer relevancy, context precision
- 4.Write judge rubrics specifically: break the score into clear, checkable criteria
- 5.Use a different (strong) model for judging than for generation — avoiding self-preference bias
- 6.Calibrate the judge: 50–100 manual ratings, compute agreement, aim for 85–90%
- 7.Counteract biases: pairwise scoring in both orders, reward conciseness in the rubric, neutral prompts
- 8.Set thresholds (quality gates) on metrics and wire the eval into CI — let it block the merge on a quality drop
- 9.Add a regression eval in staging — compare the new version with the previous one on the golden set
- 10.Deploy an online eval in production: sample real traffic, alert on quality drops
- 11.Add every new production failure to the golden set — so the same bug never returns unnoticed
- 12.Combine tools deliberately: DeepEval/Promptfoo for the pipeline, Ragas for RAG, Braintrust when you want one platform
Key takeaways
Without evaluation you develop an AI application blind — every prompt or model change is a roulette. Evals turn "seems better" into a number and separate a professional AI team from an amateur one. The foundation is a golden dataset from real failures (200–500 diverse examples), not synthetic ones. The most effective scoring method is LLM-as-a-judge — ~85% agreement with humans at a cost 500–5,000× lower — but only after calibration and with bias counteraction (position, verbosity, self-preference, sycophancy). Match metrics to the application (for RAG: faithfulness, answer relevancy, context precision), wire evals into CI as quality gates and add every failure to the golden set. Tools: DeepEval for the pipeline, Ragas for RAG, Promptfoo for security, Braintrust when you want one platform.
---
I help companies build evaluation systems for AI applications — from the golden dataset and metric selection, through LLM-as-a-judge calibration and bias counteraction, to wiring evals into CI/CD and production monitoring. Get in touch — I start with a free 30-minute analysis of your use case.
/// RELATED_RECORDS
How AI Reads Invoices from Email and Enters Them into ERP
AI can automatically read an invoice from an email attachment — PDF, scan, or phone photo — and enter the data directly into an ERP system without any manual retyping. Full automation of cost invoice processing: from the mailbox to accounting.
Where to Start with AI Implementation in Your Company
AI implementation starts not with choosing a tool, but with identifying one repetitive process that wastes the most human time. Learn step by step how to select, map, and automate that process.
How to Build a Company Internal Knowledge Base with AI (RAG in Practice)
An internal knowledge base built on RAG lets you create your own company chatbot that answers only from your company's documents — not the model's guesses. Safe, up-to-date, precise AI with full control over your data.
Signal received?
Terminate
Silence
Initiate protocol. Establish connection. Let's build something loud.
