Do I have to use DeepEval, or can I write my own assertions?

You don't have to. For simple cases your own functions are enough: **assert keyword in response**, cosine similarity with embeddings, regex for JSON format validation. DeepEval and RAGAS pay off with ≥ 20 quality criteria or when you need LLM-as-judge with result interpretation. Start with simple assertions — add a framework when you feel their limits.

How do I test an agent with multiple tools?

Test each tool individually (unit test with a mocked API). For the whole agent: define end-to-end scenarios with the expected final result, not the exact path through tools. DeepEval has agent metrics: Tool Correctness, Task Completion, Step Efficiency. Mock external APIs (databases, CRM) — test the agent's logic, not external infrastructure.

How often should I run the full evaluation?

Schedule: unit tests on every commit (< 5s), integration tests on every PR to a feature branch (~2 min, ~$0.01), full LLM-as-judge evaluation on merge to main (5–15 min, ~$0.05), baseline regression once a day automatically. Don't run full evaluation on every commit — you'll blow the budget and slow CI to double digits of minutes.

What to do when tests are flaky — sometimes passing, sometimes failing?

Three causes: temperature > 0 (add seed=42 if the model supports it), threshold too close to the boundary (if threshold=0.75 and the model oscillates between 0.73–0.77, that's normal variance — adjust the threshold or refine the criterion), criteria too vague for G-Eval. Fix: run the test 3× with pytest-repeat, accept if ≥ 2/3 pass — a single result is too small a sample for probabilistic systems.

How many cases should a golden dataset have — 20, 50 or 100?

It depends on coverage. Minimum: **30 cases** for 80% of the most common queries. Optimal: **50–100 cases** covering edge cases and the long tail. Beyond 100 — diminishing returns, unless the app handles 10+ query categories. Practical rule: every new production bug adds one case — after 3 months in production the dataset will grow to a sensible size on its own. Start with 30 and expand.

How to test an agent that uses databases and external APIs?

Three layers: (1) **unit tests with mocking** of the database and external APIs — you test the agent's logic in isolation, without side effects; (2) **integration tests** on a test database instance with seed data — the agent always starts from the same state; (3) **end-to-end** on a staging environment with controlled, realistic data. Key rule: never test on the production database. Seed data fixtures are absolutely essential — without them tests are non-reproducible.

How do you calibrate an LLM-judge — how do you know it's reliable?

Calibration in 3 steps: (1) collect 100–200 application outputs and label them manually (2 annotators — calculate inter-rater agreement between them); (2) run the LLM-judge on the same examples and calculate agreement with the manual labels (Pearson r or weighted kappa); (3) if r > 0.80 — the judge is reliable for this domain. If r < 0.70 — improve the judge's prompt (add examples with explanations, sharpen the criteria) or use a larger model. Repeat calibration after any major change to the application's domain.

Should you write tests before or after deploying an AI model?

Definitely **before** — this is "test-driven prompt development". Before writing a prompt, define 10–15 golden cases with expected criteria. Then iterate the prompt until the tests pass. Benefits: you force precise thinking about requirements, you have an objective "done" gate, you avoid confirmation bias (you're not evaluating a prompt based on what happened to come out). In practice, 1 hour on a golden dataset saves 3–5 hours of debugging after deployment.

RETURN_TO_BLOG

2026-06-07AI & Automation 14 min

How to Write Tests for AI Applications: Test Pyramid, Golden Dataset and CI/CD for LLMs

Q: What to do when tests are flaky — sometimes passing, sometimes failing?

Three causes: temperature > 0 (add seed=42 if the model supports it), threshold too close to the boundary (if threshold=0.75 and the model oscillates between 0.73–0.77, that's normal variance — adjust the threshold or refine the criterion), criteria too vague for G-Eval. Fix: run the test 3× with pytest-repeat, accept if ≥ 2/3 pass — a single result is too small a sample for probabilistic systems.

Q: Should you write tests before or after deploying an AI model?

Definitely **before** — this is "test-driven prompt development". Before writing a prompt, define 10–15 golden cases with expected criteria. Then iterate the prompt until the tests pass. Benefits: you force precise thinking about requirements, you have an objective "done" gate, you avoid confirmation bias (you're not evaluating a prompt based on what happened to come out). In practice, 1 hour on a golden dataset saves 3–5 hours of debugging after deployment.

Paweł Wiszniewski

SEO & GEO Specialist · AI Engineer

Testing AI applications requires different tools from classical code — an LLM is non-deterministic, so instead of checking string equality you test semantics, behaviour and edge-case coverage. A properly designed test suite gives you confidence before deployment: semantic tests check whether the response is meaningfully correct, contract-schema tests verify output format, and regression tests detect whether a model or prompt update has broken existing behaviour.

How to design a test suite for LLM applications: unit tests with mocking, integration tests with a golden dataset, LLM-as-judge evaluation and a quality gate in CI/CD. With real Python code and GitHub Actions.

You change a prompt from "reply in Polish" to "always respond in the Polish language". Tests pass — they were written for the old version. Three weeks later a client calls: "your app started responding in English for some users." It turns out a model update changed default behaviour and the new prompt stopped working for edge cases. You had no regression tests, so you had no idea.

Testing AI applications is a different problem from testing classic code. LLM output is non-deterministic — identical input can produce different output on every call. But you can design a test suite that gives you confidence before deployment.

Why won't classic assertEquals work for LLMs?

Classic tests check equality: assert result == expected. For LLMs this makes no sense — the model says the same thing in different words every time.

Three fundamental differences: - Non-determinism — temperature > 0 produces different output on every call, even with identical input - Semantics instead of syntax — "You have 14 days" and "The return window is two weeks" are the same answer — assertEqual rejects both versions as wrong - No oracle — there is no single "correct" answer the way there is for a sorting algorithm or a parser

Result: tests that look like tests but don't catch real regressions.

What to test — and what not to?

Test:

the logic of prompt construction — deterministic, testable like any other code
response parsing correctness — does the JSON have the required fields and correct types?
output format — does the result fit within character limits, does it have the required structure?
semantic quality criteria — faithfulness, relevance — on a sample of 5–10% of calls
semantic regression against a baseline after every model update

Don't test:

the base model itself — OpenAI, Anthropic and Google test that for you
exact content word-for-word — the model will rephrase and the test will fail
model performance on general benchmarks — that's not your code
internal framework mechanics (LangChain, LlamaIndex) — you'd be testing their code, not yours

/// TEST PYRAMID FOR AI APPLICATIONS

The higher you go, the slower and pricier — but the deeper the semantic evaluation

Don't test just one layer — each catches a different class of bugs

LLM-as-judge evaluation

G-Eval · semantic quality scoring · 0.85+ correlation with humans

SLOW · EXPENSIVE · 5–10% OF CALLS

Integration — Golden Dataset

Real API · DeepEval / RAGAS · faithfulness + relevance

~$0.01/PR · MINUTES · PER AI PR

Unit — Mock LLM

Deterministic · pytest · test the logic, not the model

$0 · <5s · PER COMMIT

UNIT TEST COST

~$0.01

INTEGRATION COST/PR

~$0.05

FULL EVAL COST

Layer 1: Unit tests — mock the model, test the code

Unit tests for AI do not call the model. They test application logic around the model: prompt construction, response parsing, error handling. Use unittest.mock — the framework returns a predefined response. Tests are fast (< 5 seconds), free and fully deterministic.

test_unit.py

# test_unit.pyfrom unittest.mock import patch, MagicMockimport pytestfrom myapp.rag_chain import build_prompt, RAGChaindef test_prompt_contains_context():    context = "Returns policy: 14 days no questions asked."    question = "How long do I have to return?"    messages = build_prompt(question, context)    user_content = " ".join(m["content"] for m in messages if m["role"] == "user")    assert "14 days" in user_content    assert question in user_contentdef test_prompt_has_system_role():    messages = build_prompt("test?", "ctx")    assert messages[0]["role"] == "system"    assert "assistant" in messages[0]["content"].lower()@patch("myapp.rag_chain.client.chat.completions.create")def test_chain_passes_context_to_model(mock_create):    mock_create.return_value = MagicMock(        choices=[MagicMock(message=MagicMock(content="You have 14 days to return."))]    )    chain = RAGChain()    result = chain.run("How long do I have to return?", context="14 days")    assert "14 days" in result    msgs = mock_create.call_args[1]["messages"]    assert any("14 days" in str(m) for m in msgs)@patch("myapp.rag_chain.client.chat.completions.create")def test_chain_raises_on_empty_context(mock_create):    chain = RAGChain()    with pytest.raises(ValueError, match="empty context"):        chain.run("question", context="")

Run: pytest tests/unit/ -v — zero API costs, results in < 5 seconds.

Layer 2: Integration tests — golden dataset as the oracle

Integration tests call the real model with a set of golden test cases. A golden dataset is a collection of input → quality criterion examples — created once, version-controlled in the repository, extended after every production bug.

How to build a golden dataset?

1.Collect 30–100 representative questions — from production logs or domain experts; more examples means better coverage
2.Add edge cases — out-of-scope questions, ambiguities, questions in a different language, extremely long contexts
3.Define a criterion, not an exact answer — "must contain the number 14", "must not contain the word 'sorry'", faithfulness threshold ≥ 0.80
4.Version it like code — store in tests/fixtures/golden.json, every change requires a code review and a comment explaining why
5.Extend after every bug — a new production failure becomes a new test case; the dataset gets more accurate over time

test_integration.py

# test_integration.pyimport json, pytestfrom deepeval import assert_testfrom deepeval.test_case import LLMTestCasefrom deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetricfrom myapp.rag_chain import RAGChain@pytest.fixture(scope="session")def chain():    return RAGChain()@pytest.mark.parametrize("case", [    {"q": "How many days to return?", "ctx": "Returns: 14 days.", "exp": "14"},    {"q": "Can I return a used product?", "ctx": "New products only.", "exp": "new"},    {"q": "Return shipping cost?", "ctx": "Free returns on orders over $50.", "exp": "50"},])def test_golden_set(case, chain):    answer = chain.run(case["q"], context=case["ctx"])    test_case = LLMTestCase(        input=case["q"],        actual_output=answer,        retrieval_context=[case["ctx"]]    )    assert_test(test_case, [        AnswerRelevancyMetric(threshold=0.7),        FaithfulnessMetric(threshold=0.8)    ])    assert case["exp"].lower() in answer.lower()

Cost: 100 cases × 500 tokens × $0.15/1M = ~$0.008 per run.

Layer 3: Evaluation — when the criterion is "good", not "correct"

For complex outputs (summaries, generated content, multi-turn conversations) we use LLM-as-judge: a second model evaluates output quality according to criteria. DeepEval implements G-Eval — a method achieving 0.85+ correlation with human evaluation.

Important rule: the judge should come from a different model family than the model being evaluated — if production uses GPT-4o, the judge should be Claude or vice versa. This eliminates self-enhancement bias (the same model rates its own output 10–25% higher than it should).

test_eval.py

# test_eval.pyimport pytestfrom deepeval import assert_testfrom deepeval.test_case import LLMTestCase, LLMTestCaseParamsfrom deepeval.metrics import GEvalfrom myapp.summarizer import summarize_documentACCURACY = GEval(    name="Accuracy",    criteria="Is the summary factually consistent with the document and free of information not present in it?",    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],    threshold=0.7)CONCISENESS = GEval(    name="Conciseness",    criteria="Is the summary concise and free of repeated information?",    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],    threshold=0.6)@pytest.mark.parametrize("doc,min_len,max_len", [    ("Q1 report: revenue up 15%. Costs stable. Margin 32%.", 20, 200),    ("Returns policy: 14 days, new products only, shipping at customer expense.", 15, 150),])def test_summarizer_quality(doc, min_len, max_len):    summary = summarize_document(doc)    assert min_len <= len(summary) <= max_len    assert_test(        LLMTestCase(input=doc, actual_output=summary),        [ACCURACY, CONCISENESS]    )

/// CI/CD WITH AN AI QUALITY GATE

Unit tests on every commit — eval only when prompts change

paths: [myapp/prompts/**, myapp/chain.py, tests/fixtures/**]

Commit / PR

every push

›

↓

Unit Tests

<5s · $0

›

↓

Smoke Eval

5 cases · AI only

›

↓

Full Eval

Golden set + LLM-judge

›

↓

Gate ≥ 0.80

Merge blocked

›

↓

Deploy

Production

<5s

UNIT TESTS PER PR

~5 min

FULL EVAL ON MERGE

~$0.05

EVAL COST / MERGE

How to configure CI/CD — a quality gate before deployment?

Goal: every PR that changes prompts or AI logic must pass a quality evaluation. If results fall below the threshold — merge is automatically blocked.

Two stages in the pipeline: - unit tests — run on every PR (< 30 seconds, $0 API cost) - eval gate — runs only when changed files match AI paths (a few minutes, ~$0.01)

.github/workflows/ai-eval.yml

name: AI Evaluation Gateon:  pull_request:    paths:      - 'myapp/prompts/**'      - 'myapp/chain.py'      - 'tests/fixtures/**'jobs:  unit-tests:    runs-on: ubuntu-latest    steps:      - uses: actions/checkout@v4      - uses: actions/setup-python@v5        with: {python-version: '3.12'}      - run: pip install -r requirements.txt      - run: pytest tests/unit/ -v --tb=short  eval-gate:    needs: unit-tests    runs-on: ubuntu-latest    env:      OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}    steps:      - uses: actions/checkout@v4      - uses: actions/setup-python@v5        with: {python-version: '3.12'}      - run: pip install -r requirements.txt deepeval      - run: deepeval test run tests/integration/ --min-success-rate 0.85      - run: python tests/regression/check_baseline.py --threshold 0.80

Evaluation results appear directly in PR comments — the developer sees which metrics failed without digging through CI logs.

How to detect regression after a model update?

OpenAI and Anthropic update models without warning. To know when something has changed, you need a baseline — saved evaluation results on the golden dataset tied to a specific model version.

1.After every deployment save results as baseline_vN.json with the date and model version
2.After any suspected model update run the eval again and compare against the baseline
3.Alert when faithfulness or relevance drops by more than 5 percentage points
4.If regression is confirmed — roll back to the previous prompt version or block the model update

Metric	Baseline (May)	After update	Change	Action
Faithfulness	0.87	0.79	−8%	Alert + prompt analysis
Answer Relevancy	0.82	0.81	−1%	OK — within variance
Hallucination rate	3.2%	4.1%	+0.9%	Monitor for a week
Latency p95	2.1s	2.3s	+10%	OK
Cost/1k calls	$0.024	$0.026	+8%	OK

Tools — what to choose and when?

Tool	Type	Best for	Price
DeepEval	Python SDK	Full pyramid: unit + integration + G-Eval + CI	Free / $29+
RAGAS	Python SDK	RAG pipelines: faithfulness, context recall, precision	Free
PromptFoo	CLI/YAML	Prompt version comparison, red teaming, A/B	Free / $20+
Evidently	Python + Action	Quality regression, dashboards, GitHub Actions	Free / $95+
LangSmith	SaaS	LangChain + eval + tracing in one place	Free / $39+

Common mistakes — what not to do

Testing the base model instead of your own code — you're checking GPT-4's intelligence, not your code. Test only the logic you wrote yourself
One giant test instead of a pyramid — if every test calls LLM-as-judge, you pay $0.05 per commit and CI takes 20 minutes
Golden dataset without edge cases — 50 happy path examples will catch 30% of bugs. Add 20% hard questions and 20% out-of-scope cases
No dataset versioning — a changed golden.json without code review is a recipe for false positives
Exact content comparison — assert result == "14 days" fails when the model says "two weeks". Use assert "14" in result or DeepEval metrics
Ignoring evaluation cost — 1000 cases × LLM-as-judge = $1–5 per run. Budget it and run full evaluation on merge, not on every commit

Checklist

1.Unit tests mock the LLM — zero API cost, deterministic, run on every commit
2.Golden dataset: 30–100 cases including edge cases and production bugs, versioned like code
3.Integration tests use DeepEval or RAGAS with thresholds — faithfulness threshold ≥ 0.75
4.G-Eval evaluation for complex outputs — judge from a different model family than production
5.Baseline after every deployment, alert when metrics drift by more than 5 percentage points
6.CI/CD: unit on every PR, eval gate only on AI changes — paths filter in GitHub Actions
7.Evaluation cost monitored and budgeted: < $0.05 per merge to main
8.Results visible in PR comments — no digging through CI logs required

---

I build testing systems for AI applications — from the unit test pyramid through golden datasets to CI/CD with a quality gate. If your LLM app is in production without a test suite, get in touch — I start with a code audit and designing the minimal sensible pyramid.

/// RELATED_SERVICES

Need these concepts implemented? Explore the services related to this topic.

Service

AI App Development

Custom AI software and AI-powered web applications. MVP development, full stack engineering, and AI systems programming from scratch to production.

View service Service

Web Engineering

Digital brutalism architecture. Sites that are not templates, but manifestos.

View service

/// SOURCES

/// RELATED_RECORDS

AI & Automation

Vibe Coding: Complete Guide to AI Coding Tools 2026

Claude Code, Cursor, GitHub Copilot, Codex CLI, Gemini CLI, Lovable, Bolt.new — 60% of all new code worldwide is AI-generated (Gartner, 2026). A complete map of 11 vibe coding tools across 3 categories, with pricing, use cases, and a selection guide for businesses.

18 min

AI & Automation

AI Deep Research — How an Agent Searches the Web and Writes the Report Instead of Your Analyst

OpenAI Deep Research, Perplexity, and web-browsing agents are reshaping desk research: a report that takes an analyst 4–8 hours, an agent finishes in 5–20 minutes with source citations. I explain how these tools work, when they genuinely replace a human and when they don't, what ROI looks like, how to build your own research-automation pipeline, and when it makes sense to let the agent do it instead of an employee.

15 min

AI & Automation

AI in Recruitment and HR 2026 — CV Screening Automation, EU AI Act Obligations, and When AI Helps vs Hurts

AI cuts CV screening time by 75%, but recruitment systems are classified as high-risk AI under the EU AI Act — with a full compliance package: human oversight, transparency, technical documentation, EU database registration. I explain what AI in HR can safely do (screening as a filter, chatbot, onboarding), where the line is (autonomous decisions without a human), which tools work for SMEs, and how to avoid legal exposure.

17 min

/// AUTHOR

Paweł Wiszniewski

SEO & GEO Specialist & AI Engineer

SEO/GEO specialist (10 years) and AI engineer (3 years). I build search visibility, AI systems and automations that reduce costs and improve operational efficiency.

LinkedIn Facebook

Signal received?

Terminate
Silence

Initiate protocol. Establish connection. Let's build something loud.

> WAITING_FOR_INPUT...

BIAŁYSTOK, PL

+48 732 022 086 pawel.wiszniewski95@gmail.com

Why won't classic assertEquals work for LLMs?

What to test — and what not to?

The higher you go, the slower and pricier — but the deeper the semantic evaluation

Layer 1: Unit tests — mock the model, test the code

Layer 2: Integration tests — golden dataset as the oracle

How to build a golden dataset?

Layer 3: Evaluation — when the criterion is "good", not "correct"

Unit tests on every commit — eval only when prompts change

How to configure CI/CD — a quality gate before deployment?

How to detect regression after a model update?

Tools — what to choose and when?

Common mistakes — what not to do

Checklist

/// RELATED_SERVICES

AI App Development

Web Engineering

/// SOURCES

/// RELATED_RECORDS

Vibe Coding: Complete Guide to AI Coding Tools 2026

AI Deep Research — How an Agent Searches the Web and Writes the Report Instead of Your Analyst

AI in Recruitment and HR 2026 — CV Screening Automation, EU AI Act Obligations, and When AI Helps vs Hurts

Signal received?

TerminateSilence

Terminate
Silence