How is LLM evaluation (evals) different from regular tests?

Classic tests check whether the code works correctly — they give a pass/fail result and assume one known correct answer (2+2=4). Evals measure the quality of the model's response, which is inherently fuzzy: the prompt "summarize this contract" has dozens of good answers and infinitely many bad ones, so they cannot be checked with a string comparison. Evals give a score on a scale or rubric, handle non-determinism (the same input gives different outputs) and judge quality, not technical correctness. Both are needed: tests ensure the application works, evals ensure it works well.

Can I trust an LLM that judges another LLM?

Yes, provided it is calibrated. Data shows an LLM judge agrees with human reviewers ~85% of the time — more than two humans agree on the same task — at a cost 500–5,000× lower than manual review. It works because judging is easier than generating. But trust requires two things: a specific rubric (broken into checkable criteria, not "rate the quality") and calibration — manually rate 50–100 examples, run the judge on them and compute agreement, aiming for 85–90%. An uncalibrated judge is a number generator you cannot trust; a calibrated judge is a reliable 24/7 reviewer.

How many examples do I need in the golden dataset?

To start, 200–500 examples — enough for the result to be statistically meaningful and few enough for the eval to be fast and cheap. More important than the count is diversity: typical, edge, hard and adversarial cases (prompt injection), in different languages and registers. 500 variants of the same question are worth less than 50 genuinely different ones. The key: build the set from real production failures, not from examples invented at a desk or generated by the model itself — a synthetic golden set only measures how well the model agrees with itself. Grow it over time: every new reported bug is a new row.

What are the biggest LLM-as-a-judge pitfalls?

Four documented judge biases. Position bias — in pairwise scoring it favors the response in the first position (counter: score in both orders and average). Verbosity bias — it prefers longer responses, mistaking length for quality (fix: reward conciseness in the rubric). Self-preference bias — it favors responses from the same model it is (fix: judge with a different model than you generate with). Sycophancy — it agrees with suggestions in the prompt (fix: write rubrics neutrally). Above all stands calibration against a human — without it you do not know whether your judge even measures what you think.

Which evaluation tool should I choose — DeepEval, Ragas, Promptfoo or Braintrust?

It depends on the need, and mature teams combine two of the three open-source ones. DeepEval is the default for CI/CD — pytest-style LLM unit tests, 14+ metrics, custom plain-English rubrics. You add Ragas when you deeply evaluate RAG — research-backed retrieval and generation metrics. You choose Promptfoo when you need red teaming and security validation alongside prompt evaluation (YAML config, no code). Braintrust is a commercial platform connecting the whole lifecycle — dataset, scoring, production monitoring and CI gates in one place — sensible as the team grows. Practically: start with DeepEval, add Ragas for RAG, reach for a platform later.

How often should I run evals?

At four levels with different frequencies. Locally — on every prompt iteration, as a fast feedback loop in seconds (a small subset of the golden set). In CI — on every pull request against the full golden set, as a gate blocking the merge on a quality drop. In staging — before every deployment, as a regression eval comparing the new version with the previous one. In production — continuously, on a sample of real traffic (online eval), with alerts on quality drops. The same discipline as tests: the earlier in the pipeline you catch a regression, the cheaper it is to fix. Crucially, the CI eval must be automated — a manually run eval quickly stops being run.

RETURN_TO_BLOG

2026-06-15AI & Automation 15 min

LLM Evaluation (Evals) — How to Measure AI Application Quality with LLM-as-a-Judge

Q: Which metrics should I measure for a RAG application?

The gold standard for RAG is a triad: faithfulness (does the answer follow from the given context, without fabrication), answer relevancy (does it actually address the question) and context precision (did retrieval fetch the right chunks). This trio is valuable because it separates two different error sources: poor retrieval (a bad search layer) from poor generation (the model got good context but used it badly). That distinction is crucial because each problem is fixed differently — you improve retrieval with chunking and reranking, generation with the prompt and model. Additionally, measure hallucination for fact-based applications.

Paweł Wiszniewski

SEO & GEO Specialist · AI Engineer

LLM evaluation (evals) is the systematic measurement of an AI application's response quality across a set of representative examples — the equivalent of automated tests, but for outputs that have no single correct answer. The most effective method in 2026 is LLM-as-a-judge: a second language model scores your application's responses against a defined rubric, agreeing with human judgement ~85% of the time (more often than two humans agree with each other) and costing 500–5,000× less than manual review. If you cannot answer the question "is the new prompt better than the old one" with a number, you have no evals and you are changing your AI application blind. The foundation is a golden dataset built from real production failures, not examples invented at a desk.

The complete guide to evaluating LLM applications: how evals differ from tests, how to build a golden dataset from real failures, how LLM-as-a-judge works and why it agrees with humans ~85% of the time, which metrics to measure (faithfulness, hallucination, answer relevancy), how to avoid judge biases, how to wire evals into CI/CD, and which tool to choose — DeepEval, Ragas, Promptfoo or Braintrust.

You change one sentence in the system prompt because the chatbot answered one customer badly. You ship the fix. A week later it turns out that change broke responses in three other scenarios nobody checked. Sound familiar? That is daily life for teams developing AI applications without evaluation — every change is a roulette, because nobody measures what actually improved and what broke.

Evaluation turns that roulette into engineering. Instead of "seems better" you get a number: 87% of responses meet the criteria, up from 81%. That is what separates a professional AI team from an amateur one — and it is most often what separates a deployment that works from a demo that fell apart in production. This article shows how to build evals from scratch: golden dataset, metrics, LLM-as-a-judge, judge biases, CI/CD and tool selection.

Evals vs tests — what's the difference?

This is the first source of confusion. In the article on testing AI applications I described tests in the software-engineering sense — does the code work, does the API return the right format, does the pipeline not crash. Evals are something else: they measure the quality of the model's response, which is inherently fuzzy.

Aspect	Classic tests	Evals (LLM evaluation)
What they check	Whether the code works correctly	Whether the model's response is good
Result	Pass / fail (binary)	A score on a scale or rubric
Determinism	Same input = same result	Same input = different outputs
Correct answer	One, known in advance	Many acceptable variants
Example	assert format == JSON	"Is the answer grounded in the context?"
Tools	pytest, jest	DeepEval, Ragas, Promptfoo

The key difference: in a classic test "2 + 2" always equals "4". In an LLM application the prompt "summarize this contract" has dozens of good answers and infinitely many bad ones — and you cannot check that with a simple string comparison. So you need a qualitative judgement that can be automated and repeated. That is what evals do.

Both approaches are needed and complement each other: tests ensure the application works at all, evals ensure it works well. Without tests the app crashes; without evals it silently degrades in quality with every prompt or model change.

Golden dataset — the foundation of every evaluation

An evaluation is only as good as the set of examples you run it on. A golden dataset is a collection of representative input cases (often with expected answers or criteria) on which you measure quality with every change. Four rules decide its worth:

Build it from real failures, not invented examples — the most valuable cases are the ones the app has already failed on in production; every reported bug is a new row in the golden set, so the same failure never returns unnoticed
Size of 200–500 examples to start — enough for the result to be statistically meaningful, few enough for the eval to be fast and cheap; grow it as you discover new edge cases
Cover diversity, not just volume — typical, edge, hard and adversarial cases (prompt injection), in different languages and registers; 500 variants of the same question are worth less than 50 genuinely different ones
Version it like code — keep the golden dataset in the repository, review changes via pull requests; it is a living artifact that grows with the application

The most common beginner mistake: generating the golden set synthetically with the model itself. Such a set measures how well the model agrees with itself, not how well it handles real, unpredictable user queries. Synthetic examples are fine as a supplement, but the core must come from real traffic.

LLM-as-a-judge — how a model evaluates a model

/// LLM QUALITY EVALUATION PIPELINE

From production logs to a quality alert

LLM calls

Each one logged

›

↓

Sampling 5%

Random sample

›

↓

LLM-judge

GPT-4o mini scores

›

↓

Aggregation

Average, trends

›

↓

Alert / Dashboard

If below threshold

★

Offline, not real-time. Evaluation runs on a sample of logs (nightly job or hourly). For 10,000 calls/day → 500 evaluations × ~$0.002 = $1/day for full quality monitoring.

0.85+

CORRELATION WITH HUMAN RATING

~$0.002

COST PER EVALUATION

SAMPLE = FULL COVERAGE

Since responses cannot be checked with a string comparison, and manual review does not scale to thousands of examples on every change — who should judge? The 2026 answer: a second language model as the judge. LLM-as-a-judge is a technique where a strong model (e.g. GPT-4o, Claude) receives your application's response along with a scoring rubric and returns a verdict — a numeric score, a label or a comparison.

Why does it work? Because judging is easier than generating. It is much simpler for a model to state "is this answer grounded in the given context" than to generate the perfect answer itself. The data confirms the effectiveness: an LLM judge agrees with human reviewers ~85% of the time — more than two humans agree on the same task — at a cost 500–5,000× lower.

There are three main scoring modes:

Pointwise — the judge scores one response at a time against criteria (e.g. "rate relevance on a 1–5 scale"); the simplest and most common
Pairwise — the judge compares two responses and picks the better one; ideal for comparing two prompt versions or two models, because relative judgement is more stable than absolute
Reference-based — the judge compares the response with a reference; used when the golden set contains expected answers

The most important rule: the judge's rubric is a product you have to refine. The more specific the criteria, the higher the agreement with humans. "Rate the quality" gives random results; "Rate whether the answer (1) addresses the question, (2) is grounded solely in the given context, (3) contains no fabricated facts — return Yes/No for each" gives a repeatable, sensible score.

Which metrics to measure

You match metrics to the application type. Different ones matter for a RAG chatbot than for an agent with tools. The most important:

Metric	What it measures	For which application
Faithfulness	Whether the answer follows from the context, no fabrication	RAG, document Q&A
Answer relevancy	Whether the answer actually addresses the question	Any Q&A application
Context precision/recall	Whether retrieval fetched the right chunks	RAG (retrieval layer)
Hallucination	Whether the model fabricated facts outside the context	RAG, summaries, facts
Task completion	Whether the agent accomplished the user's task	Agents with tools
Tool correctness	Whether the agent chose and called the right tool	Agents, function calling
Toxicity / bias	Whether the response is safe and neutral	Public apps, customer support

For RAG applications (most business deployments) the gold standard is a triad: faithfulness (does it not fabricate), answer relevancy (does it answer the question) and context precision (does retrieval work). These three metrics separate two different error sources — poor retrieval (a bad search layer) from poor generation (the model got good context but used it badly) — which is crucial, because each is fixed differently. I covered the retrieval layer in the article on advanced RAG.

LLM-as-a-judge pitfalls — judge biases

The LLM judge is not objective. It has documented, systematic tendencies that will skew results if you do not know about them and counteract them:

Position bias — in pairwise scoring the judge favors the response in the first (or last) position regardless of content; counter: score each pair in both orders and average
Verbosity bias — the judge prefers longer, wordier responses, mistaking length for quality; counter: explicitly reward conciseness and penalize fluff in the rubric
Self-preference bias — the judge favors responses generated by the same model it is; counter: use a different model for judging than for generation
Sycophancy — the judge agrees with suggestions embedded in the prompt ("is this great answer good?"); counter: write rubrics neutrally, without hinting at the expected verdict

The most important safeguard: calibrating the judge against a human. Before you trust automatic scores, have a human manually rate 50–100 examples, run the LLM judge on them and compute agreement. Aim for 85–90%. If agreement is lower, refine the rubric and repeat. An uncalibrated judge is a number generator you cannot trust; a calibrated judge is a reliable, cheap reviewer running 24/7.

Evaluation in code — a DeepEval example

The shortest path to a working eval is DeepEval — you write it like a pytest test. Below is an evaluation of a RAG response on three metrics at once:

eval_rag.py

from deepeval import evaluatefrom deepeval.test_case import LLMTestCasefrom deepeval.metrics import (    FaithfulnessMetric,    AnswerRelevancyMetric,    ContextualPrecisionMetric,)# A golden-dataset example: question + retrieval context + app responsetest_case = LLMTestCase(    input="What is the notice period in the contract?",    actual_output="The notice period is 3 months.",    expected_output="3 months",    retrieval_context=[        "Sec. 8: The agreement may be terminated with "        "a three-month notice period."    ],)# The LLM judge scores each metric; threshold 0.7 = quality gatemetrics = [    FaithfulnessMetric(threshold=0.7, model="gpt-4o"),    AnswerRelevancyMetric(threshold=0.7, model="gpt-4o"),    ContextualPrecisionMetric(threshold=0.7, model="gpt-4o"),]results = evaluate(test_cases=[test_case], metrics=metrics)Three things that make a difference here:- **model="gpt-4o" is the judge, not the app** — you use a strong model different from the one generating responses (avoiding self-preference bias)- **threshold=0.7 is the quality gate** — below it the test fails; this is the threshold that blocks a merge in CI when a change degrades quality- **retrieval_context separates retrieval from generation** — faithfulness checks whether the answer follows from the context; context precision checks whether retrieval supplied the right chunk; you separate two error sources

In production you do not run a single case but the whole golden dataset (200–500 rows), and you aggregate the result: "92% passed the faithfulness threshold". That percentage is the number you compare between versions.

Evals in CI/CD — quality gates

/// DEEPEVAL vs RAGAS vs PROMPTFOO vs BRAINTRUST — WHICH TOOL?

DeepEval

PIPELINE / CI

SpecialtyLLM unit tests

Metrics14+ built-in

Custom judgePlain-English rubric

CI/CDPytest-style

Best forPipeline integration

Ragas

RAG

SpecialtyRAG evaluation

MetricsFaithfulness, recall

HeritageResearch-backed

CI/CDVia integrations

Best forRAG pipelines

Promptfoo

RED TEAM

SpecialtyRed teaming, security

MetricsComparisons, assertions

ConfigYAML, no code

CI/CDNative

Best forSecurity + prompts

Braintrust

PLATFORM

SpecialtyFull eval lifecycle

ScopeDataset→prod→CI

HostingSaaS

CI/CDBuilt-in gates

Best forTeams, one place

2 of 3

TOOLS COMBINED BY MATURE TEAMS

open

SOURCE — DEEPEVAL RAGAS · PROMPTFOO

FULL-LIFECYCLE PLATFORM (BRAINTRUST)

The full value of evals appears when you wire them into the pipeline like tests. Mature production evaluation in 2026 is four stages with automated quality gates:

1.Local development — the developer iterates on the prompt, running DeepEval or Promptfoo on the golden set like unit tests; a feedback loop in seconds
2.PR / merge (CI) — on every pull request an automated eval runs on the full golden set; if quality drops below the threshold, the gate blocks the merge — exactly like a failing test
3.Staging — a regression eval compares the new version with the previous one; it catches silent regressions on known cases before they reach users
4.Production — an online eval on a sample of real traffic; the judge scores a random percentage of live responses, and an alert fires on a quality drop (this connects to the AI monitoring from a separate article)

The tool choice depends on the need — and in practice mature teams combine two of the three open-source ones:

DeepEval — when you want LLM unit tests integrated with the pipeline (pytest-style, 14+ metrics, custom plain-English rubrics); the default choice for CI/CD
Ragas — when you deeply evaluate RAG; research-backed retrieval and generation metrics, the most cited in academic papers; often added to DeepEval
Promptfoo — when you need red teaming and security validation (prompt injection!) alongside prompt evaluation; YAML config, no code
Braintrust — when you want one platform connecting the whole lifecycle: dataset, scoring, production monitoring and CI gates in one place (a commercial SaaS)

The rule: start with DeepEval (or Promptfoo if you prefer YAML), add Ragas when RAG needs deeper analysis, and reach for a platform like Braintrust when the team grows and you want everything in one tool.

LLM evaluation deployment checklist

1.Build a golden dataset from real production failures — 200–500 diverse examples, not synthetic ones
2.Version the golden set in the repository and review changes via pull requests
3.Match metrics to the application type — for RAG the triad: faithfulness, answer relevancy, context precision
4.Write judge rubrics specifically: break the score into clear, checkable criteria
5.Use a different (strong) model for judging than for generation — avoiding self-preference bias
6.Calibrate the judge: 50–100 manual ratings, compute agreement, aim for 85–90%
7.Counteract biases: pairwise scoring in both orders, reward conciseness in the rubric, neutral prompts
8.Set thresholds (quality gates) on metrics and wire the eval into CI — let it block the merge on a quality drop
9.Add a regression eval in staging — compare the new version with the previous one on the golden set
10.Deploy an online eval in production: sample real traffic, alert on quality drops
11.Add every new production failure to the golden set — so the same bug never returns unnoticed
12.Combine tools deliberately: DeepEval/Promptfoo for the pipeline, Ragas for RAG, Braintrust when you want one platform

Key takeaways

Without evaluation you develop an AI application blind — every prompt or model change is a roulette. Evals turn "seems better" into a number and separate a professional AI team from an amateur one. The foundation is a golden dataset from real failures (200–500 diverse examples), not synthetic ones. The most effective scoring method is LLM-as-a-judge — ~85% agreement with humans at a cost 500–5,000× lower — but only after calibration and with bias counteraction (position, verbosity, self-preference, sycophancy). Match metrics to the application (for RAG: faithfulness, answer relevancy, context precision), wire evals into CI as quality gates and add every failure to the golden set. Tools: DeepEval for the pipeline, Ragas for RAG, Promptfoo for security, Braintrust when you want one platform.

---

I help companies build evaluation systems for AI applications — from the golden dataset and metric selection, through LLM-as-a-judge calibration and bias counteraction, to wiring evals into CI/CD and production monitoring. Get in touch — I start with a free 30-minute analysis of your use case.

/// RELATED_SERVICES

Need these concepts implemented? Explore the services related to this topic.

Service

AI App Development

Custom AI software and AI-powered web applications. MVP development, full stack engineering, and AI systems programming from scratch to production.

View service

/// SOURCES

/// RELATED_RECORDS

AI & Automation

Vibe Coding: Complete Guide to AI Coding Tools 2026

Claude Code, Cursor, GitHub Copilot, Codex CLI, Gemini CLI, Lovable, Bolt.new — 60% of all new code worldwide is AI-generated (Gartner, 2026). A complete map of 11 vibe coding tools across 3 categories, with pricing, use cases, and a selection guide for businesses.

18 min

AI & Automation

AI Deep Research — How an Agent Searches the Web and Writes the Report Instead of Your Analyst

OpenAI Deep Research, Perplexity, and web-browsing agents are reshaping desk research: a report that takes an analyst 4–8 hours, an agent finishes in 5–20 minutes with source citations. I explain how these tools work, when they genuinely replace a human and when they don't, what ROI looks like, how to build your own research-automation pipeline, and when it makes sense to let the agent do it instead of an employee.

15 min

AI & Automation

AI in Recruitment and HR 2026 — CV Screening Automation, EU AI Act Obligations, and When AI Helps vs Hurts

AI cuts CV screening time by 75%, but recruitment systems are classified as high-risk AI under the EU AI Act — with a full compliance package: human oversight, transparency, technical documentation, EU database registration. I explain what AI in HR can safely do (screening as a filter, chatbot, onboarding), where the line is (autonomous decisions without a human), which tools work for SMEs, and how to avoid legal exposure.

17 min

/// AUTHOR

Paweł Wiszniewski

SEO & GEO Specialist & AI Engineer

SEO/GEO specialist (10 years) and AI engineer (3 years). I build search visibility, AI systems and automations that reduce costs and improve operational efficiency.

LinkedIn Facebook

Signal received?

Terminate
Silence

Initiate protocol. Establish connection. Let's build something loud.

> WAITING_FOR_INPUT...

BIAŁYSTOK, PL

+48 732 022 086 pawel.wiszniewski95@gmail.com

Evals vs tests — what's the difference?

Golden dataset — the foundation of every evaluation

LLM-as-a-judge — how a model evaluates a model

From production logs to a quality alert

Which metrics to measure

LLM-as-a-judge pitfalls — judge biases

Evaluation in code — a DeepEval example

Evals in CI/CD — quality gates

LLM evaluation deployment checklist

Key takeaways

/// RELATED_SERVICES

AI App Development

/// SOURCES

/// RELATED_RECORDS

Vibe Coding: Complete Guide to AI Coding Tools 2026

AI Deep Research — How an Agent Searches the Web and Writes the Report Instead of Your Analyst

AI in Recruitment and HR 2026 — CV Screening Automation, EU AI Act Obligations, and When AI Helps vs Hurts

Signal received?

TerminateSilence

Terminate
Silence