RETURN_TO_BLOG
AI & Automation 15 min

LLM Evaluation (Evals) — How to Measure AI Application Quality with LLM-as-a-Judge

LLM evaluation (evals) is the systematic measurement of an AI application's response quality across a set of representative examples — the equivalent of automated tests, but for outputs that have no single correct answer. The most effective method in 2026 is LLM-as-a-judge: a second language model scores your application's responses against a defined rubric, agreeing with human judgement ~85% of the time (more often than two humans agree with each other) and costing 500–5,000× less than manual review. If you cannot answer the question "is the new prompt better than the old one" with a number, you have no evals and you are changing your AI application blind. The foundation is a golden dataset built from real production failures, not examples invented at a desk.

The complete guide to evaluating LLM applications: how evals differ from tests, how to build a golden dataset from real failures, how LLM-as-a-judge works and why it agrees with humans ~85% of the time, which metrics to measure (faithfulness, hallucination, answer relevancy), how to avoid judge biases, how to wire evals into CI/CD, and which tool to choose — DeepEval, Ragas, Promptfoo or Braintrust.

You change one sentence in the system prompt because the chatbot answered one customer badly. You ship the fix. A week later it turns out that change broke responses in three other scenarios nobody checked. Sound familiar? That is daily life for teams developing AI applications without evaluation — every change is a roulette, because nobody measures what actually improved and what broke.

Evaluation turns that roulette into engineering. Instead of "seems better" you get a number: 87% of responses meet the criteria, up from 81%. That is what separates a professional AI team from an amateur one — and it is most often what separates a deployment that works from a demo that fell apart in production. This article shows how to build evals from scratch: golden dataset, metrics, LLM-as-a-judge, judge biases, CI/CD and tool selection.

Evals vs tests — what's the difference?

This is the first source of confusion. In the article on testing AI applications I described tests in the software-engineering sense — does the code work, does the API return the right format, does the pipeline not crash. Evals are something else: they measure the quality of the model's response, which is inherently fuzzy.

AspectClassic testsEvals (LLM evaluation)
What they checkWhether the code works correctlyWhether the model's response is good
ResultPass / fail (binary)A score on a scale or rubric
DeterminismSame input = same resultSame input = different outputs
Correct answerOne, known in advanceMany acceptable variants
Exampleassert format == JSON"Is the answer grounded in the context?"
Toolspytest, jestDeepEval, Ragas, Promptfoo

The key difference: in a classic test "2 + 2" always equals "4". In an LLM application the prompt "summarize this contract" has dozens of good answers and infinitely many bad ones — and you cannot check that with a simple string comparison. So you need a qualitative judgement that can be automated and repeated. That is what evals do.

Both approaches are needed and complement each other: tests ensure the application works at all, evals ensure it works well. Without tests the app crashes; without evals it silently degrades in quality with every prompt or model change.

Golden dataset — the foundation of every evaluation

An evaluation is only as good as the set of examples you run it on. A golden dataset is a collection of representative input cases (often with expected answers or criteria) on which you measure quality with every change. Four rules decide its worth:

  • Build it from real failures, not invented examples — the most valuable cases are the ones the app has already failed on in production; every reported bug is a new row in the golden set, so the same failure never returns unnoticed
  • Size of 200–500 examples to start — enough for the result to be statistically meaningful, few enough for the eval to be fast and cheap; grow it as you discover new edge cases
  • Cover diversity, not just volume — typical, edge, hard and adversarial cases (prompt injection), in different languages and registers; 500 variants of the same question are worth less than 50 genuinely different ones
  • Version it like code — keep the golden dataset in the repository, review changes via pull requests; it is a living artifact that grows with the application

The most common beginner mistake: generating the golden set synthetically with the model itself. Such a set measures how well the model agrees with itself, not how well it handles real, unpredictable user queries. Synthetic examples are fine as a supplement, but the core must come from real traffic.

LLM-as-a-judge — how a model evaluates a model

/// LLM QUALITY EVALUATION PIPELINE

From production logs to a quality alert

01
LLM calls
Each one logged
02
Sampling 5%
Random sample
03
LLM-judge
GPT-4o mini scores
04
Aggregation
Average, trends
05
Alert / Dashboard
If below threshold
Offline, not real-time. Evaluation runs on a sample of logs (nightly job or hourly). For 10,000 calls/day → 500 evaluations × ~$0.002 = $1/day for full quality monitoring.
0.85+
CORRELATION WITH HUMAN RATING
~$0.002
COST PER EVALUATION
5%
SAMPLE = FULL COVERAGE

Since responses cannot be checked with a string comparison, and manual review does not scale to thousands of examples on every change — who should judge? The 2026 answer: a second language model as the judge. LLM-as-a-judge is a technique where a strong model (e.g. GPT-4o, Claude) receives your application's response along with a scoring rubric and returns a verdict — a numeric score, a label or a comparison.

Why does it work? Because judging is easier than generating. It is much simpler for a model to state "is this answer grounded in the given context" than to generate the perfect answer itself. The data confirms the effectiveness: an LLM judge agrees with human reviewers ~85% of the time — more than two humans agree on the same task — at a cost 500–5,000× lower.

There are three main scoring modes:

  • Pointwise — the judge scores one response at a time against criteria (e.g. "rate relevance on a 1–5 scale"); the simplest and most common
  • Pairwise — the judge compares two responses and picks the better one; ideal for comparing two prompt versions or two models, because relative judgement is more stable than absolute
  • Reference-based — the judge compares the response with a reference; used when the golden set contains expected answers

The most important rule: the judge's rubric is a product you have to refine. The more specific the criteria, the higher the agreement with humans. "Rate the quality" gives random results; "Rate whether the answer (1) addresses the question, (2) is grounded solely in the given context, (3) contains no fabricated facts — return Yes/No for each" gives a repeatable, sensible score.

Which metrics to measure

You match metrics to the application type. Different ones matter for a RAG chatbot than for an agent with tools. The most important:

MetricWhat it measuresFor which application
FaithfulnessWhether the answer follows from the context, no fabricationRAG, document Q&A
Answer relevancyWhether the answer actually addresses the questionAny Q&A application
Context precision/recallWhether retrieval fetched the right chunksRAG (retrieval layer)
HallucinationWhether the model fabricated facts outside the contextRAG, summaries, facts
Task completionWhether the agent accomplished the user's taskAgents with tools
Tool correctnessWhether the agent chose and called the right toolAgents, function calling
Toxicity / biasWhether the response is safe and neutralPublic apps, customer support

For RAG applications (most business deployments) the gold standard is a triad: faithfulness (does it not fabricate), answer relevancy (does it answer the question) and context precision (does retrieval work). These three metrics separate two different error sources — poor retrieval (a bad search layer) from poor generation (the model got good context but used it badly) — which is crucial, because each is fixed differently. I covered the retrieval layer in the article on advanced RAG.

LLM-as-a-judge pitfalls — judge biases

The LLM judge is not objective. It has documented, systematic tendencies that will skew results if you do not know about them and counteract them:

  • Position bias — in pairwise scoring the judge favors the response in the first (or last) position regardless of content; counter: score each pair in both orders and average
  • Verbosity bias — the judge prefers longer, wordier responses, mistaking length for quality; counter: explicitly reward conciseness and penalize fluff in the rubric
  • Self-preference bias — the judge favors responses generated by the same model it is; counter: use a different model for judging than for generation
  • Sycophancy — the judge agrees with suggestions embedded in the prompt ("is this great answer good?"); counter: write rubrics neutrally, without hinting at the expected verdict

The most important safeguard: calibrating the judge against a human. Before you trust automatic scores, have a human manually rate 50–100 examples, run the LLM judge on them and compute agreement. Aim for 85–90%. If agreement is lower, refine the rubric and repeat. An uncalibrated judge is a number generator you cannot trust; a calibrated judge is a reliable, cheap reviewer running 24/7.

Evaluation in code — a DeepEval example

The shortest path to a working eval is DeepEval — you write it like a pytest test. Below is an evaluation of a RAG response on three metrics at once:

eval_rag.py
from deepeval import evaluatefrom deepeval.test_case import LLMTestCasefrom deepeval.metrics import (    FaithfulnessMetric,    AnswerRelevancyMetric,    ContextualPrecisionMetric,)# A golden-dataset example: question + retrieval context + app responsetest_case = LLMTestCase(    input="What is the notice period in the contract?",    actual_output="The notice period is 3 months.",    expected_output="3 months",    retrieval_context=[        "Sec. 8: The agreement may be terminated with "        "a three-month notice period."    ],)# The LLM judge scores each metric; threshold 0.7 = quality gatemetrics = [    FaithfulnessMetric(threshold=0.7, model="gpt-4o"),    AnswerRelevancyMetric(threshold=0.7, model="gpt-4o"),    ContextualPrecisionMetric(threshold=0.7, model="gpt-4o"),]results = evaluate(test_cases=[test_case], metrics=metrics)Three things that make a difference here:- **model="gpt-4o" is the judge, not the app** — you use a strong model different from the one generating responses (avoiding self-preference bias)- **threshold=0.7 is the quality gate** — below it the test fails; this is the threshold that blocks a merge in CI when a change degrades quality- **retrieval_context separates retrieval from generation** — faithfulness checks whether the answer follows from the context; context precision checks whether retrieval supplied the right chunk; you separate two error sources

In production you do not run a single case but the whole golden dataset (200–500 rows), and you aggregate the result: "92% passed the faithfulness threshold". That percentage is the number you compare between versions.

Evals in CI/CD — quality gates

/// DEEPEVAL vs RAGAS vs PROMPTFOO vs BRAINTRUST — WHICH TOOL?

DeepEval
PIPELINE / CI
SpecialtyLLM unit tests
Metrics14+ built-in
Custom judgePlain-English rubric
CI/CDPytest-style
Best forPipeline integration
Ragas
RAG
SpecialtyRAG evaluation
MetricsFaithfulness, recall
HeritageResearch-backed
CI/CDVia integrations
Best forRAG pipelines
Promptfoo
RED TEAM
SpecialtyRed teaming, security
MetricsComparisons, assertions
ConfigYAML, no code
CI/CDNative
Best forSecurity + prompts
Braintrust
PLATFORM
SpecialtyFull eval lifecycle
ScopeDataset→prod→CI
HostingSaaS
CI/CDBuilt-in gates
Best forTeams, one place
2 of 3
TOOLS COMBINED BY MATURE TEAMS
open
SOURCE — DEEPEVAL RAGAS · PROMPTFOO
1
FULL-LIFECYCLE PLATFORM (BRAINTRUST)

The full value of evals appears when you wire them into the pipeline like tests. Mature production evaluation in 2026 is four stages with automated quality gates:

  1. 1.Local development — the developer iterates on the prompt, running DeepEval or Promptfoo on the golden set like unit tests; a feedback loop in seconds
  2. 2.PR / merge (CI) — on every pull request an automated eval runs on the full golden set; if quality drops below the threshold, the gate blocks the merge — exactly like a failing test
  3. 3.Staging — a regression eval compares the new version with the previous one; it catches silent regressions on known cases before they reach users
  4. 4.Production — an online eval on a sample of real traffic; the judge scores a random percentage of live responses, and an alert fires on a quality drop (this connects to the AI monitoring from a separate article)

The tool choice depends on the need — and in practice mature teams combine two of the three open-source ones:

  • DeepEval — when you want LLM unit tests integrated with the pipeline (pytest-style, 14+ metrics, custom plain-English rubrics); the default choice for CI/CD
  • Ragas — when you deeply evaluate RAG; research-backed retrieval and generation metrics, the most cited in academic papers; often added to DeepEval
  • Promptfoo — when you need red teaming and security validation (prompt injection!) alongside prompt evaluation; YAML config, no code
  • Braintrust — when you want one platform connecting the whole lifecycle: dataset, scoring, production monitoring and CI gates in one place (a commercial SaaS)

The rule: start with DeepEval (or Promptfoo if you prefer YAML), add Ragas when RAG needs deeper analysis, and reach for a platform like Braintrust when the team grows and you want everything in one tool.

LLM evaluation deployment checklist

  1. 1.Build a golden dataset from real production failures — 200–500 diverse examples, not synthetic ones
  2. 2.Version the golden set in the repository and review changes via pull requests
  3. 3.Match metrics to the application type — for RAG the triad: faithfulness, answer relevancy, context precision
  4. 4.Write judge rubrics specifically: break the score into clear, checkable criteria
  5. 5.Use a different (strong) model for judging than for generation — avoiding self-preference bias
  6. 6.Calibrate the judge: 50–100 manual ratings, compute agreement, aim for 85–90%
  7. 7.Counteract biases: pairwise scoring in both orders, reward conciseness in the rubric, neutral prompts
  8. 8.Set thresholds (quality gates) on metrics and wire the eval into CI — let it block the merge on a quality drop
  9. 9.Add a regression eval in staging — compare the new version with the previous one on the golden set
  10. 10.Deploy an online eval in production: sample real traffic, alert on quality drops
  11. 11.Add every new production failure to the golden set — so the same bug never returns unnoticed
  12. 12.Combine tools deliberately: DeepEval/Promptfoo for the pipeline, Ragas for RAG, Braintrust when you want one platform

Key takeaways

Without evaluation you develop an AI application blind — every prompt or model change is a roulette. Evals turn "seems better" into a number and separate a professional AI team from an amateur one. The foundation is a golden dataset from real failures (200–500 diverse examples), not synthetic ones. The most effective scoring method is LLM-as-a-judge — ~85% agreement with humans at a cost 500–5,000× lower — but only after calibration and with bias counteraction (position, verbosity, self-preference, sycophancy). Match metrics to the application (for RAG: faithfulness, answer relevancy, context precision), wire evals into CI as quality gates and add every failure to the golden set. Tools: DeepEval for the pipeline, Ragas for RAG, Promptfoo for security, Braintrust when you want one platform.

---

I help companies build evaluation systems for AI applications — from the golden dataset and metric selection, through LLM-as-a-judge calibration and bias counteraction, to wiring evals into CI/CD and production monitoring. Get in touch — I start with a free 30-minute analysis of your use case.

/// AUTHOR
Paweł Wiszniewski – AI & Web Engineer

Paweł Wiszniewski

SEO & GEO Specialist & AI Engineer

SEO/GEO specialist (10 years) and AI engineer (3 years). I build search visibility, AI systems and automations that reduce costs and improve operational efficiency.

Signal received?

Terminate
Silence

Initiate protocol. Establish connection. Let's build something loud.

> WAITING_FOR_INPUT...