What is the minimum viable LLM monitoring setup?

Three elements: (1) structured log of each call with trace_id, model, tokens, latency — one JSON line, easy to add; (2) cost alerts — daily budget exceeded; (3) error rate monitoring — if > 5% in 10 minutes, get a notification. Add quality evaluation (LLM-judge) in week 2, business metrics in month 2. Start with the minimum, extend gradually.

How much does LLM-judge evaluation cost?

For 5% sampling of 10,000 daily calls: 500 evaluations × ~800 tokens (question + context + answer + judge prompt) = 400,000 tokens/day. On gpt-4o-mini at $0.15/1M: **~$0.06/day**. For 100,000 daily calls: ~$0.60/day. Evaluation cost is 1–2% of production costs — an excellent investment in visibility.

Does monitoring affect latency?

Structured logging is synchronous but adds < 1ms (just a logger.info call). LLM-judge evaluation is fully asynchronous — runs in a background task after sending the response to the user. Alerting checks run in a background thread every 60 seconds. Zero impact on user-perceived latency.

How do I detect a model update by OpenAI?

Three methods: (1) log model version from each response — OpenAI sometimes updates it silently; (2) track quality metric trends — a sudden faithfulness or completeness drop often signals a model change; (3) subscribe to OpenAI status page and changelog. After any suspected change, run a baseline comparison on 1,000 representative queries from the previous week.

RETURN_TO_BLOG

2026-06-06AI & Automation 15 min

Monitoring AI in Production: Metrics, Tracing and Alerts for LLM Applications

Q: How do I detect a model update by OpenAI?

Three methods: (1) log model version from each response — OpenAI sometimes updates it silently; (2) track quality metric trends — a sudden faithfulness or completeness drop often signals a model change; (3) subscribe to OpenAI status page and changelog. After any suspected change, run a baseline comparison on 1,000 representative queries from the previous week.

Paweł Wiszniewski

SEO & GEO Specialist · AI Engineer

LLM monitoring in production means tracking not just infrastructure metrics (CPU, latency, error rate) but the semantic quality of every response: hallucination rate, context loss, topic drift after a model update. An HTTP 200 with a hallucination is worse than a timeout with an error code — classical monitoring can't tell the difference. Without LLM-specific observability, quality degradation is invisible until a user reports it.

How to monitor an LLM application so you know about failures before your users do. Three layers of metrics, distributed tracing, automated quality evaluation and alerting systems — with real Python code.

Six months after launch, your AI assistant is handling 2,000 queries a day. One morning a user writes: "your bot has been giving nonsense for two days." You check the logs — indeed, since a model update three days ago, the RAG pipeline has been returning answers without context. Two days of silent failures, hundreds of bad answers, users losing confidence.

This is the standard scenario for teams that deploy LLM applications without monitoring. Classic infrastructure monitoring (CPU, RAM, p95 latency) doesn't catch the core problems of AI systems: hallucinations, context loss, quality degradation after a model update.

How LLM monitoring differs from classic monitoring

Classic monitoring asks: is the server working? LLM monitoring asks: is the model answering correctly?

The differences that matter: - Latency of 30 seconds can be acceptable (complex reasoning) or a sign of a loop — context decides - An HTTP 200 response can contain a complete hallucination — status code tells you nothing about quality - Quality degrades gradually — not a sudden crash but a slow drift you won't catch without trend analysis - Costs scale with traffic — without monitoring, a single bad prompt can send costs skyward - Model updates (by OpenAI, without notice) change behaviour — you need a baseline to detect drift

Three layers of AI monitoring

Layer 1: Infrastructure

The foundation — without this, the rest doesn't work: - Latency p50/p95/p99 — separately for each endpoint and model - Error rate — HTTP 4xx/5xx, timeouts, rate limit hits - Throughput — queries per minute, trend over time - Token cost — input + output + cached, per feature - Rate limit proximity — how close to the OpenAI limit

Layer 2: LLM Quality

The most important layer, most often omitted: - Faithfulness — does the answer follow from the provided context? - Answer relevance — does the answer address the question? - Hallucination rate — percentage of answers with invented facts - Completeness — does the answer cover all aspects of the question? - Toxicity — harmful or inappropriate content

Layer 3: Business Impact

Ties technical quality to real value: - Task completion rate — did the user achieve their goal? - Satisfaction signal — thumbs up/down, explicit ratings - Escalation rate — how often does the bot hand off to a human? - Conversion — did AI interaction lead to the desired action? - Session abandonment — did the user leave mid-conversation?

/// THREE LAYERS OF AI MONITORING

From infrastructure to business impact

01INFRASTRUCTURE & APIMonitor like any software

Latency p50 / p95 / p99

end-to-end + TTFT

Error rate (API + validation)

alert > 5%

Throughput (req/s, req/day)

per endpoint

Token cost (input + output)

daily trend

Rate limit proximity

alert > 80%

02LLM ANSWER QUALITYRequires evaluation — not code

Faithfulness

is it grounded in context

Answer relevance

does it answer the question

Hallucination rate

alert > 10%

Completeness

is the answer complete

Toxicity / Safety

alert > 0.5%

03BUSINESS IMPACTThe ultimate measure of AI success

Task completion rate

goal achieved?

Thumbs up / down ratio

direct feedback

Escalation rate

alert < 2% or > 25%

Conversion

purchase, sign-up

Session abandonment

alert > 40%

MONITORING LAYERS

15+

KEY METRICS

<$2

EVALUATION PER DAY (10K CALLS)

Tracing — every call logged

The core of production monitoring is structured logging of every LLM call. Each trace should contain:

1.trace_id — unique call identifier, passes through the entire system
2.model — which model, which version
3.input_tokens / output_tokens — for cost calculation
4.cached_tokens — how many tokens came from cache
5.latency_ms — end-to-end response time
6.prompt_hash — identifies recurring queries
7.cost_usd — calculated cost per call
8.user_id — for per-user analysis
9.feature — which part of the application generated the call

Implementation in Python

llm_tracer.py

# llm_tracer.pyimport time, hashlib, uuid, loggingfrom dataclasses import dataclass, asdictfrom typing import Optionalfrom openai import OpenAIlogger = logging.getLogger(__name__)client = OpenAI()COST_PER_1M = {    "gpt-4o-mini": (0.15, 0.60),    "gpt-4o": (2.50, 10.00),    "o3-mini": (1.10, 4.40)}@dataclassclass LLMTrace:    trace_id: str    model: str    input_tokens: int    output_tokens: int    cached_tokens: int    latency_ms: float    prompt_hash: str    cost_usd: float    error: Optional[str] = None    user_id: Optional[str] = None    feature: Optional[str] = Nonedef traced_call(messages, model="gpt-4o-mini", trace_id=None, user_id=None, feature=None) -> tuple[str, LLMTrace]:    trace_id = trace_id or str(uuid.uuid4())    prompt_hash = hashlib.sha256(str(messages[0]).encode()).hexdigest()[:12]    t0 = time.perf_counter()    try:        resp = client.chat.completions.create(model=model, messages=messages)        latency_ms = (time.perf_counter() - t0) * 1000        usage = resp.usage        in_p, out_p = COST_PER_1M[model]        cost = (usage.prompt_tokens * in_p + usage.completion_tokens * out_p) / 1_000_000        trace = LLMTrace(trace_id, model, usage.prompt_tokens,            usage.completion_tokens, getattr(usage, "cached_tokens", 0),            latency_ms, prompt_hash, cost, user_id=user_id, feature=feature)        logger.info("llm_trace %s", asdict(trace))        return resp.choices[0].message.content, trace    except Exception as e:        trace = LLMTrace(trace_id, model, 0, 0, 0,            (time.perf_counter() - t0)*1000, prompt_hash, 0.0,            error=str(e), user_id=user_id, feature=feature)        logger.error("llm_trace_error %s", asdict(trace))        raise

Output: One JSON line per call — ready for CloudWatch, Datadog, Grafana Loki or BigQuery.

Quality evaluation — LLM as judge

Automated quality evaluation works in a sampling model: you evaluate 5–10% of production calls using a second LLM as judge.

Metric	Scale	What it measures	Alert threshold
Faithfulness	1–5	Does the answer follow from the context?	< 3.5
Relevance	1–5	Does it address the question?	< 3.0
Completeness	1–5	Does it cover all aspects?	< 3.0
Hallucination	boolean	Invented facts detected?	> 5% calls
Toxicity	boolean	Harmful content?	> 0% calls

llm_judge.py

# llm_judge.pyimport instructorfrom openai import OpenAIfrom pydantic import BaseModel, Fieldfrom typing import Optionalic = instructor.from_openai(OpenAI())class EvalResult(BaseModel):    faithfulness: int = Field(ge=1, le=5, description="1=contradicts context, 5=fully grounded")    relevance: int = Field(ge=1, le=5, description="1=off-topic, 5=fully answers question")    completeness: int = Field(ge=1, le=5, description="1=incomplete, 5=covers all aspects")    hallucination_detected: bool    issues: Optional[list[str]] = Nonedef evaluate_rag_response(question: str, context: str, answer: str) -> EvalResult:    user_content = ("QUESTION: " + question        + "CONTEXT: " + context        + "AI ANSWER TO EVALUATE: " + answer)    return ic.chat.completions.create(        model="gpt-4o-mini",        response_model=EvalResult,        messages=[            {"role": "system", "content": "You are an expert evaluator of AI assistant responses. Be strict and objective."},            {"role": "user", "content": user_content}        ]    )

/// LLM QUALITY EVALUATION PIPELINE

From production logs to a quality alert

LLM calls

Each one logged

›

↓

Sampling 5%

Random sample

›

↓

LLM-judge

GPT-4o mini scores

›

↓

Aggregation

Average, trends

›

↓

Alert / Dashboard

If below threshold

★

Offline, not real-time. Evaluation runs on a sample of logs (nightly job or hourly). For 10,000 calls/day → 500 evaluations × ~$0.002 = $1/day for full quality monitoring.

0.85+

CORRELATION WITH HUMAN RATING

~$0.002

COST PER EVALUATION

SAMPLE = FULL COVERAGE

Alerts — what to watch and what threshold to set

Hard alerts (immediate action required)

Error rate > 5% in 5-minute window — likely model or API failure
p95 latency > 30s — likely token generation loop or overloaded context
Faithfulness < 3.5 average in 60-minute window — RAG quality degradation
Cost growth > 300% vs. previous day — prompt injection or runaway loop
Rate limit hit rate > 20% — need to implement throttling or increase limits

Soft alerts (monitor, don't wake at night)

p50 latency growth trend — gradual model loading increase
Hallucination rate > 3% — requires prompt review, not immediate intervention
Completion rate drop > 10% vs. weekly baseline — possible UX or quality issue
Cache hit rate drop — context structure may have changed
Specific user error spike — could indicate abuse or edge case

alerting.py

# alerting.pyfrom collections import dequefrom datetime import datetime, timedeltafrom dataclasses import dataclass, fieldfrom typing import Callable@dataclassclass Alert:    level: str    metric: str    value: float    threshold: float    message: strclass MetricBuffer:    def __init__(self, window_minutes=10):        self.window = timedelta(minutes=window_minutes)        self._data: deque = deque()    def add(self, value: float):        now = datetime.utcnow()        self._data.append((now, value))        while self._data and (now - self._data[0][0]) > self.window:            self._data.popleft()    def mean(self): return sum(v for _, v in self._data) / len(self._data) if self._data else 0    def p95(self):        if not self._data: return 0        vals = sorted(v for _, v in self._data)        return vals[int(len(vals) * 0.95)]    def count(self): return len(self._data)latency = MetricBuffer(5)error_rate = MetricBuffer(5)faithfulness = MetricBuffer(60)def check_alerts(notify: Callable[[Alert], None]):    if error_rate.count() > 10 and error_rate.mean() > 0.05:        notify(Alert("critical", "error_rate", error_rate.mean(), 0.05, "Error rate exceeded 5%"))    if latency.p95() > 30_000:        notify(Alert("critical", "latency_p95", latency.p95(), 30_000, "p95 latency > 30s"))    if faithfulness.mean() < 3.5:        notify(Alert("warning", "faithfulness", faithfulness.mean(), 3.5, "RAG quality degraded"))

Run check_alerts every 60 seconds in a background thread — with Slack, PagerDuty or email as the notify target.

A/B testing prompts

Comparing prompt versions is standard practice for production AI systems. A framework in 5 steps:

1.Define the hypothesis — "adding chain-of-thought will improve faithfulness by 0.3 points"
2.Split traffic — e.g. 90% control, 10% variant (small sample first)
3.Collect evaluations — LLM-judge on both groups, minimum 500 calls per group
4.Statistical test — Mann-Whitney U (non-normal distributions typical in AI quality metrics)
5.Make a decision — if p < 0.05 and improvement > minimal threshold, roll out; else revert

Don't test more than two variants at once — it complicates analysis and lengthens the time to a statistically significant result.

Production monitoring checklist

1.Structured logging of every LLM call with trace_id, model, tokens, latency, cost
2.Sampling 5–10% of calls for LLM-judge quality evaluation
3.Dashboards: latency p50/p95/p99, error rate, cost per feature
4.Hard alerts: error rate > 5%, p95 > 30s, faithfulness < 3.5
5.Soft alerts: latency trends, cache hit rate, completion rate
6.Post-deployment baselines after each model update
7.Automatic A/B testing framework for prompt changes
8.Cost alerts: daily budget, anomaly detection
9.Business metrics correlated with technical quality
10.Weekly review of quality metric trends

Tool	Type	Best for	Cost
Helicone	Proxy	Quick start, zero code changes	Free / $20+
LangSmith	SDK	LangChain teams	Free / $39+
Datadog LLM	APM	Enterprise, existing Datadog	$$$
Langfuse	Open source	Full control, self-hosted	Free
Custom (as above)	Custom	Complete flexibility	Infrastructure cost

---

I build AI application monitoring systems for production — from distributed tracing and automated quality evaluation to dashboards and alerting. If your LLM app is already in production but you don't know what's happening inside it, get in touch — I start with an audit of your logs and a baseline metrics setup.

/// RELATED_SERVICES

Need these concepts implemented? Explore the services related to this topic.

Service

AI App Development

Custom AI software and AI-powered web applications. MVP development, full stack engineering, and AI systems programming from scratch to production.

View service

/// SOURCES

/// RELATED_RECORDS

AI & Automation

Vibe Coding: Complete Guide to AI Coding Tools 2026

Claude Code, Cursor, GitHub Copilot, Codex CLI, Gemini CLI, Lovable, Bolt.new — 60% of all new code worldwide is AI-generated (Gartner, 2026). A complete map of 11 vibe coding tools across 3 categories, with pricing, use cases, and a selection guide for businesses.

18 min

AI & Automation

AI Deep Research — How an Agent Searches the Web and Writes the Report Instead of Your Analyst

OpenAI Deep Research, Perplexity, and web-browsing agents are reshaping desk research: a report that takes an analyst 4–8 hours, an agent finishes in 5–20 minutes with source citations. I explain how these tools work, when they genuinely replace a human and when they don't, what ROI looks like, how to build your own research-automation pipeline, and when it makes sense to let the agent do it instead of an employee.

15 min

AI & Automation

AI in Recruitment and HR 2026 — CV Screening Automation, EU AI Act Obligations, and When AI Helps vs Hurts

AI cuts CV screening time by 75%, but recruitment systems are classified as high-risk AI under the EU AI Act — with a full compliance package: human oversight, transparency, technical documentation, EU database registration. I explain what AI in HR can safely do (screening as a filter, chatbot, onboarding), where the line is (autonomous decisions without a human), which tools work for SMEs, and how to avoid legal exposure.

17 min

/// AUTHOR

Paweł Wiszniewski

SEO & GEO Specialist & AI Engineer

SEO/GEO specialist (10 years) and AI engineer (3 years). I build search visibility, AI systems and automations that reduce costs and improve operational efficiency.

LinkedIn Facebook

Signal received?

Terminate
Silence

Initiate protocol. Establish connection. Let's build something loud.

> WAITING_FOR_INPUT...

BIAŁYSTOK, PL

+48 732 022 086 pawel.wiszniewski95@gmail.com

How LLM monitoring differs from classic monitoring

Three layers of AI monitoring

Layer 1: Infrastructure

Layer 2: LLM Quality

Layer 3: Business Impact

From infrastructure to business impact

Tracing — every call logged

Implementation in Python

Quality evaluation — LLM as judge

From production logs to a quality alert

Alerts — what to watch and what threshold to set

Hard alerts (immediate action required)

Soft alerts (monitor, don't wake at night)

A/B testing prompts

Production monitoring checklist

/// RELATED_SERVICES

AI App Development

/// SOURCES

/// RELATED_RECORDS

Vibe Coding: Complete Guide to AI Coding Tools 2026

AI Deep Research — How an Agent Searches the Web and Writes the Report Instead of Your Analyst

AI in Recruitment and HR 2026 — CV Screening Automation, EU AI Act Obligations, and When AI Helps vs Hurts

Signal received?

TerminateSilence

Terminate
Silence