RETURN_TO_BLOG
AI & Automation 15 min

Monitoring AI in Production: Metrics, Tracing and Alerts for LLM Applications

How to monitor an LLM application so you know about failures before your users do. Three layers of metrics, distributed tracing, automated quality evaluation and alerting systems — with real Python code.

Six months after launch, your AI assistant is handling 2,000 queries a day. One morning a user writes: "your bot has been giving nonsense for two days." You check the logs — indeed, since a model update three days ago, the RAG pipeline has been returning answers without context. Two days of silent failures, hundreds of bad answers, users losing confidence.

This is the standard scenario for teams that deploy LLM applications without monitoring. Classic infrastructure monitoring (CPU, RAM, p95 latency) doesn't catch the core problems of AI systems: hallucinations, context loss, quality degradation after a model update.

How LLM monitoring differs from classic monitoring

Classic monitoring asks: is the server working? LLM monitoring asks: is the model answering correctly?

The differences that matter: - Latency of 30 seconds can be acceptable (complex reasoning) or a sign of a loop — context decides - An HTTP 200 response can contain a complete hallucination — status code tells you nothing about quality - Quality degrades gradually — not a sudden crash but a slow drift you won't catch without trend analysis - Costs scale with traffic — without monitoring, a single bad prompt can send costs skyward - Model updates (by OpenAI, without notice) change behaviour — you need a baseline to detect drift

Three layers of AI monitoring

Layer 1: Infrastructure

The foundation — without this, the rest doesn't work: - Latency p50/p95/p99 — separately for each endpoint and model - Error rate — HTTP 4xx/5xx, timeouts, rate limit hits - Throughput — queries per minute, trend over time - Token cost — input + output + cached, per feature - Rate limit proximity — how close to the OpenAI limit

Layer 2: LLM Quality

The most important layer, most often omitted: - Faithfulness — does the answer follow from the provided context? - Answer relevance — does the answer address the question? - Hallucination rate — percentage of answers with invented facts - Completeness — does the answer cover all aspects of the question? - Toxicity — harmful or inappropriate content

Layer 3: Business Impact

Ties technical quality to real value: - Task completion rate — did the user achieve their goal? - Satisfaction signal — thumbs up/down, explicit ratings - Escalation rate — how often does the bot hand off to a human? - Conversion — did AI interaction lead to the desired action? - Session abandonment — did the user leave mid-conversation?

/// TRZY WARSTWY MONITORINGU AI

Od infrastruktury do wpływu na biznes

01INFRASTRUKTURA I APIMonitoruj jak każdy software
Latency p50 / p95 / p99
end-to-end + TTFT
Error rate (API + walidacja)
alert > 5%
Throughput (req/s, req/dzień)
per endpoint
Koszt tokenów (input + output)
trend dzienny
Rate limit proximity
alert > 80%
02JAKOŚĆ ODPOWIEDZI LLMWymaga ewaluacji — nie kodu
Faithfulness
czy bazuje na kontekście
Answer relevance
czy odpowiada na pytanie
Hallucination rate
alert > 10%
Completeness
czy pełna odpowiedź
Toxicity / Safety
alert > 0.5%
03WPŁYW NA BIZNESOstateczna miara sukcesu AI
Task completion rate
cel osiągnięty?
Thumbs up / down ratio
bezpośredni feedback
Escalation rate
alert < 2% lub > 25%
Conversion
zakup, rejestracja
Session abandonment
alert > 40%
3
WARSTWY MONITORINGU
15+
KLUCZOWYCH METRYK
<$2
EWALUACJA PER DZIEŃ (10K CALLS)

Tracing — every call logged

The core of production monitoring is structured logging of every LLM call. Each trace should contain:

  1. 1.trace_id — unique call identifier, passes through the entire system
  2. 2.model — which model, which version
  3. 3.input_tokens / output_tokens — for cost calculation
  4. 4.cached_tokens — how many tokens came from cache
  5. 5.latency_ms — end-to-end response time
  6. 6.prompt_hash — identifies recurring queries
  7. 7.cost_usd — calculated cost per call
  8. 8.user_id — for per-user analysis
  9. 9.feature — which part of the application generated the call

Implementation in Python

llm_tracer.py
# llm_tracer.pyimport time, hashlib, uuid, loggingfrom dataclasses import dataclass, asdictfrom typing import Optionalfrom openai import OpenAIlogger = logging.getLogger(__name__)client = OpenAI()COST_PER_1M = {    "gpt-4o-mini": (0.15, 0.60),    "gpt-4o": (2.50, 10.00),    "o3-mini": (1.10, 4.40)}@dataclassclass LLMTrace:    trace_id: str    model: str    input_tokens: int    output_tokens: int    cached_tokens: int    latency_ms: float    prompt_hash: str    cost_usd: float    error: Optional[str] = None    user_id: Optional[str] = None    feature: Optional[str] = Nonedef traced_call(messages, model="gpt-4o-mini", trace_id=None, user_id=None, feature=None) -> tuple[str, LLMTrace]:    trace_id = trace_id or str(uuid.uuid4())    prompt_hash = hashlib.sha256(str(messages[0]).encode()).hexdigest()[:12]    t0 = time.perf_counter()    try:        resp = client.chat.completions.create(model=model, messages=messages)        latency_ms = (time.perf_counter() - t0) * 1000        usage = resp.usage        in_p, out_p = COST_PER_1M[model]        cost = (usage.prompt_tokens * in_p + usage.completion_tokens * out_p) / 1_000_000        trace = LLMTrace(trace_id, model, usage.prompt_tokens,            usage.completion_tokens, getattr(usage, "cached_tokens", 0),            latency_ms, prompt_hash, cost, user_id=user_id, feature=feature)        logger.info("llm_trace %s", asdict(trace))        return resp.choices[0].message.content, trace    except Exception as e:        trace = LLMTrace(trace_id, model, 0, 0, 0,            (time.perf_counter() - t0)*1000, prompt_hash, 0.0,            error=str(e), user_id=user_id, feature=feature)        logger.error("llm_trace_error %s", asdict(trace))        raise

Output: One JSON line per call — ready for CloudWatch, Datadog, Grafana Loki or BigQuery.

Quality evaluation — LLM as judge

Automated quality evaluation works in a sampling model: you evaluate 5–10% of production calls using a second LLM as judge.

MetricScaleWhat it measuresAlert threshold
Faithfulness1–5Does the answer follow from the context?< 3.5
Relevance1–5Does it address the question?< 3.0
Completeness1–5Does it cover all aspects?< 3.0
HallucinationbooleanInvented facts detected?> 5% calls
ToxicitybooleanHarmful content?> 0% calls
llm_judge.py
# llm_judge.pyimport instructorfrom openai import OpenAIfrom pydantic import BaseModel, Fieldfrom typing import Optionalic = instructor.from_openai(OpenAI())class EvalResult(BaseModel):    faithfulness: int = Field(ge=1, le=5, description="1=contradicts context, 5=fully grounded")    relevance: int = Field(ge=1, le=5, description="1=off-topic, 5=fully answers question")    completeness: int = Field(ge=1, le=5, description="1=incomplete, 5=covers all aspects")    hallucination_detected: bool    issues: Optional[list[str]] = Nonedef evaluate_rag_response(question: str, context: str, answer: str) -> EvalResult:    user_content = ("QUESTION: " + question        + "CONTEXT: " + context        + "AI ANSWER TO EVALUATE: " + answer)    return ic.chat.completions.create(        model="gpt-4o-mini",        response_model=EvalResult,        messages=[            {"role": "system", "content": "You are an expert evaluator of AI assistant responses. Be strict and objective."},            {"role": "user", "content": user_content}        ]    )

/// PIPELINE EWALUACJI JAKOŚCI LLM

Od logów produkcyjnych do alertu jakości

01
Wywołania LLM
Każde logowane
02
Sampling 5%
Losowa próbka
03
LLM-judge
GPT-4o mini ocenia
04
Aggregacja
Średnia, trendy
05
Alert / Dashboard
Jeśli poniżej progu
Offline, nie real-time. Ewaluacja działa na próbce logów (nocny job lub co godzinę). Dla 10 000 wywołań/dzień → 500 ewaluacji × ~$0.002 = $1/dzień za pełny quality monitoring.
0.85+
KORELACJA Z OCENĄ CZŁOWIEKA
~$0.002
KOSZT JEDNEJ EWALUACJI
5%
PRÓBKA = PEŁNE POKRYCIE

Alerts — what to watch and what threshold to set

Hard alerts (immediate action required)

  • Error rate > 5% in 5-minute window — likely model or API failure
  • p95 latency > 30s — likely token generation loop or overloaded context
  • Faithfulness < 3.5 average in 60-minute window — RAG quality degradation
  • Cost growth > 300% vs. previous day — prompt injection or runaway loop
  • Rate limit hit rate > 20% — need to implement throttling or increase limits

Soft alerts (monitor, don't wake at night)

  • p50 latency growth trend — gradual model loading increase
  • Hallucination rate > 3% — requires prompt review, not immediate intervention
  • Completion rate drop > 10% vs. weekly baseline — possible UX or quality issue
  • Cache hit rate drop — context structure may have changed
  • Specific user error spike — could indicate abuse or edge case
alerting.py
# alerting.pyfrom collections import dequefrom datetime import datetime, timedeltafrom dataclasses import dataclass, fieldfrom typing import Callable@dataclassclass Alert:    level: str    metric: str    value: float    threshold: float    message: strclass MetricBuffer:    def __init__(self, window_minutes=10):        self.window = timedelta(minutes=window_minutes)        self._data: deque = deque()    def add(self, value: float):        now = datetime.utcnow()        self._data.append((now, value))        while self._data and (now - self._data[0][0]) > self.window:            self._data.popleft()    def mean(self): return sum(v for _, v in self._data) / len(self._data) if self._data else 0    def p95(self):        if not self._data: return 0        vals = sorted(v for _, v in self._data)        return vals[int(len(vals) * 0.95)]    def count(self): return len(self._data)latency = MetricBuffer(5)error_rate = MetricBuffer(5)faithfulness = MetricBuffer(60)def check_alerts(notify: Callable[[Alert], None]):    if error_rate.count() > 10 and error_rate.mean() > 0.05:        notify(Alert("critical", "error_rate", error_rate.mean(), 0.05, "Error rate exceeded 5%"))    if latency.p95() > 30_000:        notify(Alert("critical", "latency_p95", latency.p95(), 30_000, "p95 latency > 30s"))    if faithfulness.mean() < 3.5:        notify(Alert("warning", "faithfulness", faithfulness.mean(), 3.5, "RAG quality degraded"))

Run check_alerts every 60 seconds in a background thread — with Slack, PagerDuty or email as the notify target.

A/B testing prompts

Comparing prompt versions is standard practice for production AI systems. A framework in 5 steps:

  1. 1.Define the hypothesis — "adding chain-of-thought will improve faithfulness by 0.3 points"
  2. 2.Split traffic — e.g. 90% control, 10% variant (small sample first)
  3. 3.Collect evaluations — LLM-judge on both groups, minimum 500 calls per group
  4. 4.Statistical test — Mann-Whitney U (non-normal distributions typical in AI quality metrics)
  5. 5.Make a decision — if p < 0.05 and improvement > minimal threshold, roll out; else revert

Don't test more than two variants at once — it complicates analysis and lengthens the time to a statistically significant result.

Production monitoring checklist

  1. 1.Structured logging of every LLM call with trace_id, model, tokens, latency, cost
  2. 2.Sampling 5–10% of calls for LLM-judge quality evaluation
  3. 3.Dashboards: latency p50/p95/p99, error rate, cost per feature
  4. 4.Hard alerts: error rate > 5%, p95 > 30s, faithfulness < 3.5
  5. 5.Soft alerts: latency trends, cache hit rate, completion rate
  6. 6.Post-deployment baselines after each model update
  7. 7.Automatic A/B testing framework for prompt changes
  8. 8.Cost alerts: daily budget, anomaly detection
  9. 9.Business metrics correlated with technical quality
  10. 10.Weekly review of quality metric trends
ToolTypeBest forCost
HeliconeProxyQuick start, zero code changesFree / $20+
LangSmithSDKLangChain teamsFree / $39+
Datadog LLMAPMEnterprise, existing Datadog$$$
LangfuseOpen sourceFull control, self-hostedFree
Custom (as above)CustomComplete flexibilityInfrastructure cost

---

I build AI application monitoring systems for production — from distributed tracing and automated quality evaluation to dashboards and alerting. If your LLM app is already in production but you don't know what's happening inside it, get in touch — I start with an audit of your logs and a baseline metrics setup.

/// AUTHOR
Paweł Wiszniewski – AI & Web Engineer

Paweł Wiszniewski

Senior Full-Stack Engineer & AI Architect

8+ years building AI systems, automations, and scalable web applications that reduce costs and improve operational efficiency.

Signal received?

Terminate
Silence

Initiate protocol. Establish connection. Let's build something loud.

> WAITING_FOR_INPUT...