RETURN_TO_BLOG
AI & Automation 14 min

How to Write Tests for AI Applications: Test Pyramid, Golden Dataset and CI/CD for LLMs

How to design a test suite for LLM applications: unit tests with mocking, integration tests with a golden dataset, LLM-as-judge evaluation and a quality gate in CI/CD. With real Python code and GitHub Actions.

You change a prompt from "reply in Polish" to "always respond in the Polish language". Tests pass — they were written for the old version. Three weeks later a client calls: "your app started responding in English for some users." It turns out a model update changed default behaviour and the new prompt stopped working for edge cases. You had no regression tests, so you had no idea.

Testing AI applications is a different problem from testing classic code. LLM output is non-deterministic — identical input can produce different output on every call. But you can design a test suite that gives you confidence before deployment.

Why won't classic assertEquals work for LLMs?

Classic tests check equality: assert result == expected. For LLMs this makes no sense — the model says the same thing in different words every time.

Three fundamental differences: - Non-determinism — temperature > 0 produces different output on every call, even with identical input - Semantics instead of syntax — "You have 14 days" and "The return window is two weeks" are the same answer — assertEqual rejects both versions as wrong - No oracle — there is no single "correct" answer the way there is for a sorting algorithm or a parser

Result: tests that look like tests but don't catch real regressions.

What to test — and what not to?

Test: - the logic of prompt construction — deterministic, testable like any other code - response parsing correctness — does the JSON have the required fields and correct types? - output format — does the result fit within character limits, does it have the required structure? - semantic quality criteria — faithfulness, relevance — on a sample of 5–10% of calls - semantic regression against a baseline after every model update

Don't test: - the base model itself — OpenAI, Anthropic and Google test that for you - exact content word-for-word — the model will rephrase and the test will fail - model performance on general benchmarks — that's not your code - internal framework mechanics (LangChain, LlamaIndex) — you'd be testing their code, not yours

/// PIRAMIDA TESTÓW DLA APLIKACJI AI

Im wyżej, tym wolniej i drożej — ale głębsza ocena semantyczna

Nie testuj tylko jednej warstwy — każda łapie inne klasy błędów

Ewaluacja LLM-as-judge
G-Eval · semantyczna ocena jakości · korelacja 0.85+ z człowiekiem
WOLNE · DROGIE · 5–10% WYWOŁAŃ
Integracyjne — Złoty Dataset
Prawdziwe API · DeepEval / RAGAS · faithfulness + relevance
~$0.01/PR · MINUTY · NA KAŻDY PR Z AI
Jednostkowe — Mock LLM
Deterministyczne · pytest · testuj logikę, nie model
$0 · <5s · NA KAŻDY COMMIT
$0
KOSZT JEDNOSTKOWYCH
~$0.01
KOSZT INTEGRACYJNYCH/PR
~$0.05
KOSZT PEŁNEJ EWALUACJI

Layer 1: Unit tests — mock the model, test the code

Unit tests for AI do not call the model. They test application logic around the model: prompt construction, response parsing, error handling. Use unittest.mock — the framework returns a predefined response. Tests are fast (< 5 seconds), free and fully deterministic.

test_unit.py
# test_unit.pyfrom unittest.mock import patch, MagicMockimport pytestfrom myapp.rag_chain import build_prompt, RAGChaindef test_prompt_contains_context():    context = "Returns policy: 14 days no questions asked."    question = "How long do I have to return?"    messages = build_prompt(question, context)    user_content = " ".join(m["content"] for m in messages if m["role"] == "user")    assert "14 days" in user_content    assert question in user_contentdef test_prompt_has_system_role():    messages = build_prompt("test?", "ctx")    assert messages[0]["role"] == "system"    assert "assistant" in messages[0]["content"].lower()@patch("myapp.rag_chain.client.chat.completions.create")def test_chain_passes_context_to_model(mock_create):    mock_create.return_value = MagicMock(        choices=[MagicMock(message=MagicMock(content="You have 14 days to return."))]    )    chain = RAGChain()    result = chain.run("How long do I have to return?", context="14 days")    assert "14 days" in result    msgs = mock_create.call_args[1]["messages"]    assert any("14 days" in str(m) for m in msgs)@patch("myapp.rag_chain.client.chat.completions.create")def test_chain_raises_on_empty_context(mock_create):    chain = RAGChain()    with pytest.raises(ValueError, match="empty context"):        chain.run("question", context="")

Run: pytest tests/unit/ -v — zero API costs, results in < 5 seconds.

Layer 2: Integration tests — golden dataset as the oracle

Integration tests call the real model with a set of golden test cases. A golden dataset is a collection of input → quality criterion examples — created once, version-controlled in the repository, extended after every production bug.

How to build a golden dataset?

  1. 1.Collect 30–100 representative questions — from production logs or domain experts; more examples means better coverage
  2. 2.Add edge cases — out-of-scope questions, ambiguities, questions in a different language, extremely long contexts
  3. 3.Define a criterion, not an exact answer — "must contain the number 14", "must not contain the word 'sorry'", faithfulness threshold ≥ 0.80
  4. 4.Version it like code — store in tests/fixtures/golden.json, every change requires a code review and a comment explaining why
  5. 5.Extend after every bug — a new production failure becomes a new test case; the dataset gets more accurate over time
test_integration.py
# test_integration.pyimport json, pytestfrom deepeval import assert_testfrom deepeval.test_case import LLMTestCasefrom deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetricfrom myapp.rag_chain import RAGChain@pytest.fixture(scope="session")def chain():    return RAGChain()@pytest.mark.parametrize("case", [    {"q": "How many days to return?", "ctx": "Returns: 14 days.", "exp": "14"},    {"q": "Can I return a used product?", "ctx": "New products only.", "exp": "new"},    {"q": "Return shipping cost?", "ctx": "Free returns on orders over $50.", "exp": "50"},])def test_golden_set(case, chain):    answer = chain.run(case["q"], context=case["ctx"])    test_case = LLMTestCase(        input=case["q"],        actual_output=answer,        retrieval_context=[case["ctx"]]    )    assert_test(test_case, [        AnswerRelevancyMetric(threshold=0.7),        FaithfulnessMetric(threshold=0.8)    ])    assert case["exp"].lower() in answer.lower()

Cost: 100 cases × 500 tokens × $0.15/1M = ~$0.008 per run.

Layer 3: Evaluation — when the criterion is "good", not "correct"

For complex outputs (summaries, generated content, multi-turn conversations) we use LLM-as-judge: a second model evaluates output quality according to criteria. DeepEval implements G-Eval — a method achieving 0.85+ correlation with human evaluation.

Important rule: the judge should come from a different model family than the model being evaluated — if production uses GPT-4o, the judge should be Claude or vice versa. This eliminates self-enhancement bias (the same model rates its own output 10–25% higher than it should).

test_eval.py
# test_eval.pyimport pytestfrom deepeval import assert_testfrom deepeval.test_case import LLMTestCase, LLMTestCaseParamsfrom deepeval.metrics import GEvalfrom myapp.summarizer import summarize_documentACCURACY = GEval(    name="Accuracy",    criteria="Is the summary factually consistent with the document and free of information not present in it?",    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],    threshold=0.7)CONCISENESS = GEval(    name="Conciseness",    criteria="Is the summary concise and free of repeated information?",    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],    threshold=0.6)@pytest.mark.parametrize("doc,min_len,max_len", [    ("Q1 report: revenue up 15%. Costs stable. Margin 32%.", 20, 200),    ("Returns policy: 14 days, new products only, shipping at customer expense.", 15, 150),])def test_summarizer_quality(doc, min_len, max_len):    summary = summarize_document(doc)    assert min_len <= len(summary) <= max_len    assert_test(        LLMTestCase(input=doc, actual_output=summary),        [ACCURACY, CONCISENESS]    )

/// CI/CD Z BRAMKĄ JAKOŚCI AI

Unit na każdym commicie — ewaluacja tylko przy zmianie promptów

paths: [myapp/prompts/**, myapp/chain.py, tests/fixtures/**]

01
Commit / PR
każdy push
02
Unit Tests
<5s · $0
03
Smoke Eval
5 cases · tylko AI
04
Full Eval
Golden set + LLM-judge
05
Bramka ≥ 0.80
Blokada merge
06
Deploy
Produkcja
<5s
UNIT TESTS NA KAŻDYM PR
~5 min
PEŁNA EWALUACJA NA MERGE
~$0.05
KOSZT EWALUACJI/MERGE

How to configure CI/CD — a quality gate before deployment?

Goal: every PR that changes prompts or AI logic must pass a quality evaluation. If results fall below the threshold — merge is automatically blocked.

Two stages in the pipeline: - unit tests — run on every PR (< 30 seconds, $0 API cost) - eval gate — runs only when changed files match AI paths (a few minutes, ~$0.01)

.github/workflows/ai-eval.yml
name: AI Evaluation Gateon:  pull_request:    paths:      - 'myapp/prompts/**'      - 'myapp/chain.py'      - 'tests/fixtures/**'jobs:  unit-tests:    runs-on: ubuntu-latest    steps:      - uses: actions/checkout@v4      - uses: actions/setup-python@v5        with: {python-version: '3.12'}      - run: pip install -r requirements.txt      - run: pytest tests/unit/ -v --tb=short  eval-gate:    needs: unit-tests    runs-on: ubuntu-latest    env:      OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}    steps:      - uses: actions/checkout@v4      - uses: actions/setup-python@v5        with: {python-version: '3.12'}      - run: pip install -r requirements.txt deepeval      - run: deepeval test run tests/integration/ --min-success-rate 0.85      - run: python tests/regression/check_baseline.py --threshold 0.80

Evaluation results appear directly in PR comments — the developer sees which metrics failed without digging through CI logs.

How to detect regression after a model update?

OpenAI and Anthropic update models without warning. To know when something has changed, you need a baseline — saved evaluation results on the golden dataset tied to a specific model version.

  1. 1.After every deployment save results as baseline_vN.json with the date and model version
  2. 2.After any suspected model update run the eval again and compare against the baseline
  3. 3.Alert when faithfulness or relevance drops by more than 5 percentage points
  4. 4.If regression is confirmed — roll back to the previous prompt version or block the model update
MetricBaseline (May)After updateChangeAction
Faithfulness0.870.79−8%Alert + prompt analysis
Answer Relevancy0.820.81−1%OK — within variance
Hallucination rate3.2%4.1%+0.9%Monitor for a week
Latency p952.1s2.3s+10%OK
Cost/1k calls$0.024$0.026+8%OK

Tools — what to choose and when?

ToolTypeBest forPrice
DeepEvalPython SDKFull pyramid: unit + integration + G-Eval + CIFree / $29+
RAGASPython SDKRAG pipelines: faithfulness, context recall, precisionFree
PromptFooCLI/YAMLPrompt version comparison, red teaming, A/BFree / $20+
EvidentlyPython + ActionQuality regression, dashboards, GitHub ActionsFree / $95+
LangSmithSaaSLangChain + eval + tracing in one placeFree / $39+

Common mistakes — what not to do

  • Testing the base model instead of your own code — you're checking GPT-4's intelligence, not your code. Test only the logic you wrote yourself
  • One giant test instead of a pyramid — if every test calls LLM-as-judge, you pay $0.05 per commit and CI takes 20 minutes
  • Golden dataset without edge cases — 50 happy path examples will catch 30% of bugs. Add 20% hard questions and 20% out-of-scope cases
  • No dataset versioning — a changed golden.json without code review is a recipe for false positives
  • Exact content comparisonassert result == "14 days" fails when the model says "two weeks". Use assert "14" in result or DeepEval metrics
  • Ignoring evaluation cost — 1000 cases × LLM-as-judge = $1–5 per run. Budget it and run full evaluation on merge, not on every commit

Checklist

  1. 1.Unit tests mock the LLM — zero API cost, deterministic, run on every commit
  2. 2.Golden dataset: 30–100 cases including edge cases and production bugs, versioned like code
  3. 3.Integration tests use DeepEval or RAGAS with thresholds — faithfulness threshold ≥ 0.75
  4. 4.G-Eval evaluation for complex outputs — judge from a different model family than production
  5. 5.Baseline after every deployment, alert when metrics drift by more than 5 percentage points
  6. 6.CI/CD: unit on every PR, eval gate only on AI changes — paths filter in GitHub Actions
  7. 7.Evaluation cost monitored and budgeted: < $0.05 per merge to main
  8. 8.Results visible in PR comments — no digging through CI logs required

---

I build testing systems for AI applications — from the unit test pyramid through golden datasets to CI/CD with a quality gate. If your LLM app is in production without a test suite, get in touch — I start with a code audit and designing the minimal sensible pyramid.

/// AUTHOR
Paweł Wiszniewski – AI & Web Engineer

Paweł Wiszniewski

Senior Full-Stack Engineer & AI Architect

8+ years building AI systems, automations, and scalable web applications that reduce costs and improve operational efficiency.

Signal received?

Terminate
Silence

Initiate protocol. Establish connection. Let's build something loud.

> WAITING_FOR_INPUT...