Monitoring AI in Production: Metrics, Tracing and Alerts for LLM Applications
How to monitor an LLM application so you know about failures before your users do. Three layers of metrics, distributed tracing, automated quality evaluation and alerting systems — with real Python code.
Six months after launch, your AI assistant is handling 2,000 queries a day. One morning a user writes: "your bot has been giving nonsense for two days." You check the logs — indeed, since a model update three days ago, the RAG pipeline has been returning answers without context. Two days of silent failures, hundreds of bad answers, users losing confidence.
This is the standard scenario for teams that deploy LLM applications without monitoring. Classic infrastructure monitoring (CPU, RAM, p95 latency) doesn't catch the core problems of AI systems: hallucinations, context loss, quality degradation after a model update.
How LLM monitoring differs from classic monitoring
Classic monitoring asks: is the server working? LLM monitoring asks: is the model answering correctly?
The differences that matter: - Latency of 30 seconds can be acceptable (complex reasoning) or a sign of a loop — context decides - An HTTP 200 response can contain a complete hallucination — status code tells you nothing about quality - Quality degrades gradually — not a sudden crash but a slow drift you won't catch without trend analysis - Costs scale with traffic — without monitoring, a single bad prompt can send costs skyward - Model updates (by OpenAI, without notice) change behaviour — you need a baseline to detect drift
Three layers of AI monitoring
Layer 1: Infrastructure
The foundation — without this, the rest doesn't work: - Latency p50/p95/p99 — separately for each endpoint and model - Error rate — HTTP 4xx/5xx, timeouts, rate limit hits - Throughput — queries per minute, trend over time - Token cost — input + output + cached, per feature - Rate limit proximity — how close to the OpenAI limit
Layer 2: LLM Quality
The most important layer, most often omitted: - Faithfulness — does the answer follow from the provided context? - Answer relevance — does the answer address the question? - Hallucination rate — percentage of answers with invented facts - Completeness — does the answer cover all aspects of the question? - Toxicity — harmful or inappropriate content
Layer 3: Business Impact
Ties technical quality to real value: - Task completion rate — did the user achieve their goal? - Satisfaction signal — thumbs up/down, explicit ratings - Escalation rate — how often does the bot hand off to a human? - Conversion — did AI interaction lead to the desired action? - Session abandonment — did the user leave mid-conversation?
/// TRZY WARSTWY MONITORINGU AI
Od infrastruktury do wpływu na biznes
Tracing — every call logged
The core of production monitoring is structured logging of every LLM call. Each trace should contain:
- 1.trace_id — unique call identifier, passes through the entire system
- 2.model — which model, which version
- 3.input_tokens / output_tokens — for cost calculation
- 4.cached_tokens — how many tokens came from cache
- 5.latency_ms — end-to-end response time
- 6.prompt_hash — identifies recurring queries
- 7.cost_usd — calculated cost per call
- 8.user_id — for per-user analysis
- 9.feature — which part of the application generated the call
Implementation in Python
# llm_tracer.pyimport time, hashlib, uuid, loggingfrom dataclasses import dataclass, asdictfrom typing import Optionalfrom openai import OpenAIlogger = logging.getLogger(__name__)client = OpenAI()COST_PER_1M = { "gpt-4o-mini": (0.15, 0.60), "gpt-4o": (2.50, 10.00), "o3-mini": (1.10, 4.40)}@dataclassclass LLMTrace: trace_id: str model: str input_tokens: int output_tokens: int cached_tokens: int latency_ms: float prompt_hash: str cost_usd: float error: Optional[str] = None user_id: Optional[str] = None feature: Optional[str] = Nonedef traced_call(messages, model="gpt-4o-mini", trace_id=None, user_id=None, feature=None) -> tuple[str, LLMTrace]: trace_id = trace_id or str(uuid.uuid4()) prompt_hash = hashlib.sha256(str(messages[0]).encode()).hexdigest()[:12] t0 = time.perf_counter() try: resp = client.chat.completions.create(model=model, messages=messages) latency_ms = (time.perf_counter() - t0) * 1000 usage = resp.usage in_p, out_p = COST_PER_1M[model] cost = (usage.prompt_tokens * in_p + usage.completion_tokens * out_p) / 1_000_000 trace = LLMTrace(trace_id, model, usage.prompt_tokens, usage.completion_tokens, getattr(usage, "cached_tokens", 0), latency_ms, prompt_hash, cost, user_id=user_id, feature=feature) logger.info("llm_trace %s", asdict(trace)) return resp.choices[0].message.content, trace except Exception as e: trace = LLMTrace(trace_id, model, 0, 0, 0, (time.perf_counter() - t0)*1000, prompt_hash, 0.0, error=str(e), user_id=user_id, feature=feature) logger.error("llm_trace_error %s", asdict(trace)) raise
Output: One JSON line per call — ready for CloudWatch, Datadog, Grafana Loki or BigQuery.
Quality evaluation — LLM as judge
Automated quality evaluation works in a sampling model: you evaluate 5–10% of production calls using a second LLM as judge.
| Metric | Scale | What it measures | Alert threshold |
|---|---|---|---|
| Faithfulness | 1–5 | Does the answer follow from the context? | < 3.5 |
| Relevance | 1–5 | Does it address the question? | < 3.0 |
| Completeness | 1–5 | Does it cover all aspects? | < 3.0 |
| Hallucination | boolean | Invented facts detected? | > 5% calls |
| Toxicity | boolean | Harmful content? | > 0% calls |
# llm_judge.pyimport instructorfrom openai import OpenAIfrom pydantic import BaseModel, Fieldfrom typing import Optionalic = instructor.from_openai(OpenAI())class EvalResult(BaseModel): faithfulness: int = Field(ge=1, le=5, description="1=contradicts context, 5=fully grounded") relevance: int = Field(ge=1, le=5, description="1=off-topic, 5=fully answers question") completeness: int = Field(ge=1, le=5, description="1=incomplete, 5=covers all aspects") hallucination_detected: bool issues: Optional[list[str]] = Nonedef evaluate_rag_response(question: str, context: str, answer: str) -> EvalResult: user_content = ("QUESTION: " + question + "CONTEXT: " + context + "AI ANSWER TO EVALUATE: " + answer) return ic.chat.completions.create( model="gpt-4o-mini", response_model=EvalResult, messages=[ {"role": "system", "content": "You are an expert evaluator of AI assistant responses. Be strict and objective."}, {"role": "user", "content": user_content} ] )
/// PIPELINE EWALUACJI JAKOŚCI LLM
Od logów produkcyjnych do alertu jakości
Alerts — what to watch and what threshold to set
Hard alerts (immediate action required)
- Error rate > 5% in 5-minute window — likely model or API failure
- p95 latency > 30s — likely token generation loop or overloaded context
- Faithfulness < 3.5 average in 60-minute window — RAG quality degradation
- Cost growth > 300% vs. previous day — prompt injection or runaway loop
- Rate limit hit rate > 20% — need to implement throttling or increase limits
Soft alerts (monitor, don't wake at night)
- p50 latency growth trend — gradual model loading increase
- Hallucination rate > 3% — requires prompt review, not immediate intervention
- Completion rate drop > 10% vs. weekly baseline — possible UX or quality issue
- Cache hit rate drop — context structure may have changed
- Specific user error spike — could indicate abuse or edge case
# alerting.pyfrom collections import dequefrom datetime import datetime, timedeltafrom dataclasses import dataclass, fieldfrom typing import Callable@dataclassclass Alert: level: str metric: str value: float threshold: float message: strclass MetricBuffer: def __init__(self, window_minutes=10): self.window = timedelta(minutes=window_minutes) self._data: deque = deque() def add(self, value: float): now = datetime.utcnow() self._data.append((now, value)) while self._data and (now - self._data[0][0]) > self.window: self._data.popleft() def mean(self): return sum(v for _, v in self._data) / len(self._data) if self._data else 0 def p95(self): if not self._data: return 0 vals = sorted(v for _, v in self._data) return vals[int(len(vals) * 0.95)] def count(self): return len(self._data)latency = MetricBuffer(5)error_rate = MetricBuffer(5)faithfulness = MetricBuffer(60)def check_alerts(notify: Callable[[Alert], None]): if error_rate.count() > 10 and error_rate.mean() > 0.05: notify(Alert("critical", "error_rate", error_rate.mean(), 0.05, "Error rate exceeded 5%")) if latency.p95() > 30_000: notify(Alert("critical", "latency_p95", latency.p95(), 30_000, "p95 latency > 30s")) if faithfulness.mean() < 3.5: notify(Alert("warning", "faithfulness", faithfulness.mean(), 3.5, "RAG quality degraded"))
Run check_alerts every 60 seconds in a background thread — with Slack, PagerDuty or email as the notify target.
A/B testing prompts
Comparing prompt versions is standard practice for production AI systems. A framework in 5 steps:
- 1.Define the hypothesis — "adding chain-of-thought will improve faithfulness by 0.3 points"
- 2.Split traffic — e.g. 90% control, 10% variant (small sample first)
- 3.Collect evaluations — LLM-judge on both groups, minimum 500 calls per group
- 4.Statistical test — Mann-Whitney U (non-normal distributions typical in AI quality metrics)
- 5.Make a decision — if p < 0.05 and improvement > minimal threshold, roll out; else revert
Don't test more than two variants at once — it complicates analysis and lengthens the time to a statistically significant result.
Production monitoring checklist
- 1.Structured logging of every LLM call with trace_id, model, tokens, latency, cost
- 2.Sampling 5–10% of calls for LLM-judge quality evaluation
- 3.Dashboards: latency p50/p95/p99, error rate, cost per feature
- 4.Hard alerts: error rate > 5%, p95 > 30s, faithfulness < 3.5
- 5.Soft alerts: latency trends, cache hit rate, completion rate
- 6.Post-deployment baselines after each model update
- 7.Automatic A/B testing framework for prompt changes
- 8.Cost alerts: daily budget, anomaly detection
- 9.Business metrics correlated with technical quality
- 10.Weekly review of quality metric trends
| Tool | Type | Best for | Cost |
|---|---|---|---|
| Helicone | Proxy | Quick start, zero code changes | Free / $20+ |
| LangSmith | SDK | LangChain teams | Free / $39+ |
| Datadog LLM | APM | Enterprise, existing Datadog | $$$ |
| Langfuse | Open source | Full control, self-hosted | Free |
| Custom (as above) | Custom | Complete flexibility | Infrastructure cost |
---
I build AI application monitoring systems for production — from distributed tracing and automated quality evaluation to dashboards and alerting. If your LLM app is already in production but you don't know what's happening inside it, get in touch — I start with an audit of your logs and a baseline metrics setup.
/// RELATED_RECORDS
How AI Reads Invoices from Email and Enters Them into ERP
AI can automatically read an invoice from an email attachment — PDF, scan, or phone photo — and enter the data directly into an ERP system without any manual retyping. Full automation of cost invoice processing: from the mailbox to accounting.
Where to Start with AI Implementation in Your Company
AI implementation starts not with choosing a tool, but with identifying one repetitive process that wastes the most human time. Learn step by step how to select, map, and automate that process.
How to Build a Company Internal Knowledge Base with AI (RAG in Practice)
An internal knowledge base built on RAG lets you create your own company chatbot that answers only from your company's documents — not the model's guesses. Safe, up-to-date, precise AI with full control over your data.
Signal received?
Terminate
Silence
Initiate protocol. Establish connection. Let's build something loud.
