How to Write Tests for AI Applications: Test Pyramid, Golden Dataset and CI/CD for LLMs
How to design a test suite for LLM applications: unit tests with mocking, integration tests with a golden dataset, LLM-as-judge evaluation and a quality gate in CI/CD. With real Python code and GitHub Actions.
You change a prompt from "reply in Polish" to "always respond in the Polish language". Tests pass — they were written for the old version. Three weeks later a client calls: "your app started responding in English for some users." It turns out a model update changed default behaviour and the new prompt stopped working for edge cases. You had no regression tests, so you had no idea.
Testing AI applications is a different problem from testing classic code. LLM output is non-deterministic — identical input can produce different output on every call. But you can design a test suite that gives you confidence before deployment.
Why won't classic assertEquals work for LLMs?
Classic tests check equality: assert result == expected. For LLMs this makes no sense — the model says the same thing in different words every time.
Three fundamental differences: - Non-determinism — temperature > 0 produces different output on every call, even with identical input - Semantics instead of syntax — "You have 14 days" and "The return window is two weeks" are the same answer — assertEqual rejects both versions as wrong - No oracle — there is no single "correct" answer the way there is for a sorting algorithm or a parser
Result: tests that look like tests but don't catch real regressions.
What to test — and what not to?
Test: - the logic of prompt construction — deterministic, testable like any other code - response parsing correctness — does the JSON have the required fields and correct types? - output format — does the result fit within character limits, does it have the required structure? - semantic quality criteria — faithfulness, relevance — on a sample of 5–10% of calls - semantic regression against a baseline after every model update
Don't test: - the base model itself — OpenAI, Anthropic and Google test that for you - exact content word-for-word — the model will rephrase and the test will fail - model performance on general benchmarks — that's not your code - internal framework mechanics (LangChain, LlamaIndex) — you'd be testing their code, not yours
/// PIRAMIDA TESTÓW DLA APLIKACJI AI
Im wyżej, tym wolniej i drożej — ale głębsza ocena semantyczna
Nie testuj tylko jednej warstwy — każda łapie inne klasy błędów
Layer 1: Unit tests — mock the model, test the code
Unit tests for AI do not call the model. They test application logic around the model: prompt construction, response parsing, error handling. Use unittest.mock — the framework returns a predefined response. Tests are fast (< 5 seconds), free and fully deterministic.
# test_unit.pyfrom unittest.mock import patch, MagicMockimport pytestfrom myapp.rag_chain import build_prompt, RAGChaindef test_prompt_contains_context(): context = "Returns policy: 14 days no questions asked." question = "How long do I have to return?" messages = build_prompt(question, context) user_content = " ".join(m["content"] for m in messages if m["role"] == "user") assert "14 days" in user_content assert question in user_contentdef test_prompt_has_system_role(): messages = build_prompt("test?", "ctx") assert messages[0]["role"] == "system" assert "assistant" in messages[0]["content"].lower()@patch("myapp.rag_chain.client.chat.completions.create")def test_chain_passes_context_to_model(mock_create): mock_create.return_value = MagicMock( choices=[MagicMock(message=MagicMock(content="You have 14 days to return."))] ) chain = RAGChain() result = chain.run("How long do I have to return?", context="14 days") assert "14 days" in result msgs = mock_create.call_args[1]["messages"] assert any("14 days" in str(m) for m in msgs)@patch("myapp.rag_chain.client.chat.completions.create")def test_chain_raises_on_empty_context(mock_create): chain = RAGChain() with pytest.raises(ValueError, match="empty context"): chain.run("question", context="")
Run: pytest tests/unit/ -v — zero API costs, results in < 5 seconds.
Layer 2: Integration tests — golden dataset as the oracle
Integration tests call the real model with a set of golden test cases. A golden dataset is a collection of input → quality criterion examples — created once, version-controlled in the repository, extended after every production bug.
How to build a golden dataset?
- 1.Collect 30–100 representative questions — from production logs or domain experts; more examples means better coverage
- 2.Add edge cases — out-of-scope questions, ambiguities, questions in a different language, extremely long contexts
- 3.Define a criterion, not an exact answer — "must contain the number 14", "must not contain the word 'sorry'", faithfulness threshold ≥ 0.80
- 4.Version it like code — store in tests/fixtures/golden.json, every change requires a code review and a comment explaining why
- 5.Extend after every bug — a new production failure becomes a new test case; the dataset gets more accurate over time
# test_integration.pyimport json, pytestfrom deepeval import assert_testfrom deepeval.test_case import LLMTestCasefrom deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetricfrom myapp.rag_chain import RAGChain@pytest.fixture(scope="session")def chain(): return RAGChain()@pytest.mark.parametrize("case", [ {"q": "How many days to return?", "ctx": "Returns: 14 days.", "exp": "14"}, {"q": "Can I return a used product?", "ctx": "New products only.", "exp": "new"}, {"q": "Return shipping cost?", "ctx": "Free returns on orders over $50.", "exp": "50"},])def test_golden_set(case, chain): answer = chain.run(case["q"], context=case["ctx"]) test_case = LLMTestCase( input=case["q"], actual_output=answer, retrieval_context=[case["ctx"]] ) assert_test(test_case, [ AnswerRelevancyMetric(threshold=0.7), FaithfulnessMetric(threshold=0.8) ]) assert case["exp"].lower() in answer.lower()
Cost: 100 cases × 500 tokens × $0.15/1M = ~$0.008 per run.
Layer 3: Evaluation — when the criterion is "good", not "correct"
For complex outputs (summaries, generated content, multi-turn conversations) we use LLM-as-judge: a second model evaluates output quality according to criteria. DeepEval implements G-Eval — a method achieving 0.85+ correlation with human evaluation.
Important rule: the judge should come from a different model family than the model being evaluated — if production uses GPT-4o, the judge should be Claude or vice versa. This eliminates self-enhancement bias (the same model rates its own output 10–25% higher than it should).
# test_eval.pyimport pytestfrom deepeval import assert_testfrom deepeval.test_case import LLMTestCase, LLMTestCaseParamsfrom deepeval.metrics import GEvalfrom myapp.summarizer import summarize_documentACCURACY = GEval( name="Accuracy", criteria="Is the summary factually consistent with the document and free of information not present in it?", evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT], threshold=0.7)CONCISENESS = GEval( name="Conciseness", criteria="Is the summary concise and free of repeated information?", evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT], threshold=0.6)@pytest.mark.parametrize("doc,min_len,max_len", [ ("Q1 report: revenue up 15%. Costs stable. Margin 32%.", 20, 200), ("Returns policy: 14 days, new products only, shipping at customer expense.", 15, 150),])def test_summarizer_quality(doc, min_len, max_len): summary = summarize_document(doc) assert min_len <= len(summary) <= max_len assert_test( LLMTestCase(input=doc, actual_output=summary), [ACCURACY, CONCISENESS] )
/// CI/CD Z BRAMKĄ JAKOŚCI AI
Unit na każdym commicie — ewaluacja tylko przy zmianie promptów
paths: [myapp/prompts/**, myapp/chain.py, tests/fixtures/**]
How to configure CI/CD — a quality gate before deployment?
Goal: every PR that changes prompts or AI logic must pass a quality evaluation. If results fall below the threshold — merge is automatically blocked.
Two stages in the pipeline: - unit tests — run on every PR (< 30 seconds, $0 API cost) - eval gate — runs only when changed files match AI paths (a few minutes, ~$0.01)
name: AI Evaluation Gateon: pull_request: paths: - 'myapp/prompts/**' - 'myapp/chain.py' - 'tests/fixtures/**'jobs: unit-tests: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-python@v5 with: {python-version: '3.12'} - run: pip install -r requirements.txt - run: pytest tests/unit/ -v --tb=short eval-gate: needs: unit-tests runs-on: ubuntu-latest env: OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} steps: - uses: actions/checkout@v4 - uses: actions/setup-python@v5 with: {python-version: '3.12'} - run: pip install -r requirements.txt deepeval - run: deepeval test run tests/integration/ --min-success-rate 0.85 - run: python tests/regression/check_baseline.py --threshold 0.80
Evaluation results appear directly in PR comments — the developer sees which metrics failed without digging through CI logs.
How to detect regression after a model update?
OpenAI and Anthropic update models without warning. To know when something has changed, you need a baseline — saved evaluation results on the golden dataset tied to a specific model version.
- 1.After every deployment save results as baseline_vN.json with the date and model version
- 2.After any suspected model update run the eval again and compare against the baseline
- 3.Alert when faithfulness or relevance drops by more than 5 percentage points
- 4.If regression is confirmed — roll back to the previous prompt version or block the model update
| Metric | Baseline (May) | After update | Change | Action |
|---|---|---|---|---|
| Faithfulness | 0.87 | 0.79 | −8% | Alert + prompt analysis |
| Answer Relevancy | 0.82 | 0.81 | −1% | OK — within variance |
| Hallucination rate | 3.2% | 4.1% | +0.9% | Monitor for a week |
| Latency p95 | 2.1s | 2.3s | +10% | OK |
| Cost/1k calls | $0.024 | $0.026 | +8% | OK |
Tools — what to choose and when?
| Tool | Type | Best for | Price |
|---|---|---|---|
| DeepEval | Python SDK | Full pyramid: unit + integration + G-Eval + CI | Free / $29+ |
| RAGAS | Python SDK | RAG pipelines: faithfulness, context recall, precision | Free |
| PromptFoo | CLI/YAML | Prompt version comparison, red teaming, A/B | Free / $20+ |
| Evidently | Python + Action | Quality regression, dashboards, GitHub Actions | Free / $95+ |
| LangSmith | SaaS | LangChain + eval + tracing in one place | Free / $39+ |
Common mistakes — what not to do
- Testing the base model instead of your own code — you're checking GPT-4's intelligence, not your code. Test only the logic you wrote yourself
- One giant test instead of a pyramid — if every test calls LLM-as-judge, you pay $0.05 per commit and CI takes 20 minutes
- Golden dataset without edge cases — 50 happy path examples will catch 30% of bugs. Add 20% hard questions and 20% out-of-scope cases
- No dataset versioning — a changed golden.json without code review is a recipe for false positives
- Exact content comparison — assert result == "14 days" fails when the model says "two weeks". Use assert "14" in result or DeepEval metrics
- Ignoring evaluation cost — 1000 cases × LLM-as-judge = $1–5 per run. Budget it and run full evaluation on merge, not on every commit
Checklist
- 1.Unit tests mock the LLM — zero API cost, deterministic, run on every commit
- 2.Golden dataset: 30–100 cases including edge cases and production bugs, versioned like code
- 3.Integration tests use DeepEval or RAGAS with thresholds — faithfulness threshold ≥ 0.75
- 4.G-Eval evaluation for complex outputs — judge from a different model family than production
- 5.Baseline after every deployment, alert when metrics drift by more than 5 percentage points
- 6.CI/CD: unit on every PR, eval gate only on AI changes — paths filter in GitHub Actions
- 7.Evaluation cost monitored and budgeted: < $0.05 per merge to main
- 8.Results visible in PR comments — no digging through CI logs required
---
I build testing systems for AI applications — from the unit test pyramid through golden datasets to CI/CD with a quality gate. If your LLM app is in production without a test suite, get in touch — I start with a code audit and designing the minimal sensible pyramid.
/// RELATED_RECORDS
How AI Reads Invoices from Email and Enters Them into ERP
AI can automatically read an invoice from an email attachment — PDF, scan, or phone photo — and enter the data directly into an ERP system without any manual retyping. Full automation of cost invoice processing: from the mailbox to accounting.
Where to Start with AI Implementation in Your Company
AI implementation starts not with choosing a tool, but with identifying one repetitive process that wastes the most human time. Learn step by step how to select, map, and automate that process.
How to Build a Company Internal Knowledge Base with AI (RAG in Practice)
An internal knowledge base built on RAG lets you create your own company chatbot that answers only from your company's documents — not the model's guesses. Safe, up-to-date, precise AI with full control over your data.
Signal received?
Terminate
Silence
Initiate protocol. Establish connection. Let's build something loud.
