Prompt caching activates automatically — do I need to configure anything?

The API creates the cache automatically when a prefix has 1024+ tokens and repeats within 5 minutes. Your job: ensure prompt structure — constant elements first. Cache valid for 5 minutes — at low traffic (<1 req/5 min) it may miss. Check `resp.usage.prompt_tokens_details.cached_tokens` in the response to monitor hit rate and trends.

How do I measure response quality when routing so I don't degrade UX?

Shadow mode: for one week send every query to both models (mini and full) but only show the user the full model's response. Compare server-side: length, ROUGE-L, user feedback. If mini achieves >95% quality — enable routing in production. Alternative: LLM-as-judge — GPT-4o rates both responses (mini vs full) on a 1–5 scale without knowing which model generated which.

What is Batch API and when does it make sense?

Batch API processes requests asynchronously with a 50% price discount — you submit a JSONL file with multiple requests, OpenAI processes within 24h. Ideal for: bulk data extraction, product description generation, document classification, nightly reports. Not suitable for chatbots and interactive agents where the user is waiting for a response.

What tools should I use to monitor API costs?

Four options from simplest: (1) OpenAI Usage Dashboard — basic view with no per-endpoint breakdown; (2) Helicone — proxy between your code and OpenAI, zero code changes, full per-call metrics; (3) LangSmith — call tracing with cost and latency, best if you use LangChain; (4) custom middleware — intercept `resp.usage` and log to your own database with dashboards and alerts. For a quick start: Helicone. For full control: custom solution with per-endpoint alerts.

RETURN_TO_BLOG

2026-06-05AI & Automation 13 min

OpenAI API cost optimisation: caching, routing and context compression

Q: How do I measure response quality when routing so I don't degrade UX?

Shadow mode: for one week send every query to both models (mini and full) but only show the user the full model's response. Compare server-side: length, ROUGE-L, user feedback. If mini achieves >95% quality — enable routing in production. Alternative: LLM-as-judge — GPT-4o rates both responses (mini vs full) on a 1–5 scale without knowing which model generated which.

Paweł Wiszniewski

SEO & GEO Specialist · AI Engineer

The AI project is live, tests passed, production deployment done — and after one month the OpenAI API bill is four times the budgeted amount. This is not a rare case. GPT-4o costs 16x more than GPT-4o mini for the same task, and an unoptimised agent context grows with every step. The four techniques below reduce costs by 60–80% without any degradation in response quality.

How to cut your OpenAI API bill by 60–80% without reducing response quality — prompt caching, model routing and conversation history compression.

Anatomy of Cost — What Generates the Bills

Every API call has two components: input tokens (prompt + history + system message + tool definitions) and output tokens (model response). Output tokens cost 2–4x more than input — but there are usually far fewer of them. So cost optimisation starts with input tokens, which account for 70–85% of a typical call's cost.

OpenAI Model Prices (2026)

The four most commonly used production models differ in price by an order of magnitude. Choosing the right one for each task is the biggest cost lever:

Model	Input (per 1M)	Output (per 1M)	Cached input	When to use
GPT-4o mini	$0.15	$0.60	$0.075	Simple tasks — 60–70% of traffic
GPT-4o	$2.50	$10.00	$1.25	Analysis, code, complex instructions
o3-mini	$1.10	$4.40	—	Maths, algorithms, reasoning
o1	$15.00	$60.00	$7.50	Hardest multi-step tasks

What Goes Into Input Tokens

In a typical agentic application, most input cost comes from elements sent on every call — not from the user's message:

System message — 200–500 tokens, constant for every call
Tool definitions — 100–300 tokens per tool, always sent
Conversation history — grows linearly, 500–2000 tok/turn after a few agent steps
RAG context — 500–4000 tokens depending on retrieval
User message — 20–200 tokens, usually the smallest part

/// API CALL COST OPTIMIZATION

Cost per 1000 calls — GPT-4o mini, before and after optimization

BEFORE OPTIMIZATION

System message (2000 tok)$0.300

Tool definitions (400 tok)$0.060

Conversation history (1500 tok)$0.225

User message (100 tok)$0.015

TOTAL / 1000 CALLS$0.600

AFTER OPTIMIZATION

System message — cachedcached$0.015

Tool definitions — cachedcached$0.006

Compressed history (600 tok)$0.090

User message$0.015

TOTAL / 1000 CALLS$0.126

-79%

COST REDUCTION

75%

CHEAPER CACHED TOKENS

10x

DIFFERENCE MINI VS FULL

Prompt Caching — The Biggest Quick Win

Prompt caching cuts the cost of repeating prompt prefixes by 75%. Available in OpenAI (from GPT-4o) and Anthropic (from Claude 3.5). The cache is created automatically when the same prefix has 1024+ tokens and appears within 5 minutes of the previous call. No configuration needed — just keep constant elements at the top of the prompt.

How to Structure Your Prompt for Caching

The order of elements in the prompt determines cache hit rate. Correct order — from most constant to most frequently changing:

1.System message — constant across the application, always cached
2.Tool definitions — constant, cached together with the system message
3.Context documents — constant per user session, cached
4.Conversation history — changes with every turn, not cached
5.User message — always new, not cached

Common mistake: putting the date, session ID or user name in the system message. This breaks the cache for every user and session — move those to the user message or a separate context block after the conversation history.

prompt_caching.py

from openai import OpenAIclient = OpenAI()SYSTEM_PROMPT = """You are a customer support assistant for Acme Corp.Rules: only answer questions about Acme products, escalate to a human when the customer is unhappy or asks for a refund > $500.[...2000 tokens of product descriptions, procedures and pricing — constant text, always cached...]"""def chat(history: list, user_message: str) -> tuple[str, dict]:    messages = [        {"role": "system", "content": SYSTEM_PROMPT},        *history,        {"role": "user", "content": user_message}    ]    resp = client.chat.completions.create(model="gpt-4o-mini", messages=messages)    details = getattr(resp.usage, "prompt_tokens_details", None)    cached_tokens = details.cached_tokens if details else 0    return resp.choices[0].message.content, {        "total_tokens": resp.usage.total_tokens,        "cached_tokens": cached_tokens,        "saved_usd": round(cached_tokens * 0.075 / 1_000_000, 6)    }

Cache hit rate in practice: an application handling 10,000 calls/day with a 2,000-token system message saves ~$22/day from caching alone. The result is available in `resp.usage.prompt_tokens_details.cached_tokens` — log it and monitor hit rate over time.

Query Routing — The Right Model for the Right Task

60–70% of queries in typical business applications don't need full GPT-4o. A router classifies the query and directs it to the cheapest model that can handle the task. The classifier cost — one GPT-4o mini call with ~20 output tokens — pays for itself immediately when most traffic goes to a model that's 16x cheaper.

When GPT-4o mini (60–70% of traffic)

Text classification — sentiment, category, user intent
Simple FAQ answers from supplied context
Data extraction from structured texts (structured output)
Summarising short documents — up to 1,000 words
Translation without specialist terminology

When GPT-4o (25–35% of traffic)

Analysis of complex documents — contracts, financial reports, specifications
Code generation and code review
Tasks with multiple tools and dependencies between calls
Responses where quality directly affects revenue or customer decisions
Content generation requiring coherence and creativity

When o3-mini or o1 (5–10% of traffic)

Mathematical proofs and algorithmic tasks
Complex SQL generation, scripts or system architecture
Risk analysis requiring multi-step reasoning
Tasks where the cost of an error far exceeds the cost of a token

model_router.py

import instructorfrom openai import OpenAIfrom pydantic import BaseModelfrom typing import Literalclient = OpenAI()ic = instructor.from_openai(OpenAI())class RouteDecision(BaseModel):    model: Literal["gpt-4o-mini", "gpt-4o", "o3-mini"]    reason: strdef classify_query(query: str) -> RouteDecision:    return ic.chat.completions.create(        model="gpt-4o-mini",        response_model=RouteDecision,        messages=[            {"role": "system", "content": "Decide which OpenAI model to use. gpt-4o-mini: simple questions, classification, extraction, FAQ. gpt-4o: analysis, code, complex reasoning. o3-mini: maths, algorithms, multi-step reasoning."},            {"role": "user", "content": f"Query: {query}"}        ]    )def smart_chat(query: str, history: list) -> str:    route = classify_query(query)    resp = client.chat.completions.create(        model=route.model,        messages=[*history, {"role": "user", "content": query}]    )    return resp.choices[0].message.content

Important: the router is an extra API call — it adds 50–100ms of latency. If latency is critical, run routing in parallel with the first LLM call, and cache routing decisions per query class (e.g. all "order status questions" → always mini).

/// MODEL ROUTING STRATEGY

Query routing — traffic distribution in a typical B2B app

INPUT QUERY→CLASSIFIER (GPT-4o mini, ~20 tok)→ROUTE TO MODEL

65%

Simple tasks→GPT-4o mini$0.15 / 1M tok

FAQ, classification, extraction, summarizing short texts

30%

Complex tasks→GPT-4o$2.50 / 1M tok

Document analysis, code generation, multi-step instructions

Reasoning→o3-mini$1.10 / 1M tok

Math, algorithms, complex code generation

$0.73

AVG COST PER 1M TOK (WITH ROUTING)

-71%

VS $2.50 FLAT (GPT-4o WITHOUT ROUTING)

<5ms

CLASSIFIER OVERHEAD

Context Compression — Stop the Growing Bill

In agents and chatbots, conversation history grows with every step. After 20 turns it can reach 8,000+ tokens sent on every call. Three strategies, from simplest to most precise:

Strategy 1: Sliding window

Keep only the last N messages. Pros: zero extra API calls, two lines of code. Cons: loss of long-term memory — the model doesn't remember early conversation context. Good for chatbots where context older than 10 turns rarely matters.

Strategy 2: Summarisation

Every N turns, compress old context into a 300-token summary using GPT-4o mini. Compression cost: ~$0.002 per session. Saving: 60–70% of history tokens while preserving key facts from the whole conversation. Good for long sessions where users return to earlier topics.

Strategy 3: Semantic pruning

Score the semantic distance of each historical message from the current query (cosine similarity of embeddings). Remove messages above the distance threshold. Most precise — keeps the most relevant fragments without lossy compression. Worth implementing for sessions of 50+ turns with a wide variety of topics.

context_compression.py

from openai import OpenAIclient = OpenAI()def compress_history(history: list[dict], keep_last: int = 6, max_tokens: int = 300) -> list[dict]:    if len(history) <= keep_last:        return history    old_msgs = history[:-keep_last]    recent = history[-keep_last:]    old_text = "".join(f"{m['role']}: {m['content']}" for m in old_msgs)    resp = client.chat.completions.create(        model="gpt-4o-mini",        messages=[            {"role": "system", "content": f"Compress this conversation to max {max_tokens} tokens. Keep key facts, decisions and context. Remove greetings and small talk."},            {"role": "user", "content": old_text}        ]    )    summary = resp.choices[0].message.content    return [        {"role": "system", "content": f"[Earlier conversation summary]: {summary}"},        *recent    ]

In practice: compression every 10 turns reduces context size by 65–70% while maintaining 95%+ response accuracy. Measure ROUGE-L between full-history and compressed responses — if it drops below 0.9, increase `max_tokens` or reduce `keep_last`.

Checklist — 10 Quick Wins

Implement in this order — each point is an independent optimisation:

1.Check the usage dashboard — find the 20% of calls generating 80% of costs
2.Move constant elements to the top of the prompt — caching activates automatically
3.Limit max_tokens — set a sensible cap instead of leaving the default
4.Replace GPT-4o with mini wherever possible — run an A/B quality test first
5.Shorten the system message — remove repetition and instructions that don't change results
6.Remove unused tool definitions from calls where tools aren't needed
7.Implement sliding window for conversation history (keep_last=10)
8.Add model routing for tasks that can be classified
9.Batch API for non-interactive tasks — 50% cheaper, asynchronous processing
10.Per-endpoint monitoring — alert when cost-per-call exceeds a set threshold

Optimisation	Difficulty	Saving	Implementation time
Prompt caching	Low	40–75% of input cost	2h
Limit max_tokens	Low	10–30%	30 min
Routing — mini for simple tasks	Medium	80–90% on those calls	1–2 days (A/B test)
Sliding window history	Low	20–50% on long sessions	2h
Full model routing	Medium	50–70% of total cost	2–3 days
Batch API	Low	50% on offline tasks	3h

---

I run AI cost audits for companies — I analyse API call logs, identify optimisation opportunities and implement routing, caching and context compression. Typical result: 50–70% cost reduction in 2 weeks. Get in touch — I start with an analysis of your API logs.

/// RELATED_SERVICES

Need these concepts implemented? Explore the services related to this topic.

Service

AI App Development

Custom AI software and AI-powered web applications. MVP development, full stack engineering, and AI systems programming from scratch to production.

View service Service

AI Consulting

Independent AI consultant for businesses. AI readiness audit, implementation strategy, and board-level advisory — before you engage any vendor.

View service

/// SOURCES

/// RELATED_RECORDS

AI & Automation

Vibe Coding: Complete Guide to AI Coding Tools 2026

Claude Code, Cursor, GitHub Copilot, Codex CLI, Gemini CLI, Lovable, Bolt.new — 60% of all new code worldwide is AI-generated (Gartner, 2026). A complete map of 11 vibe coding tools across 3 categories, with pricing, use cases, and a selection guide for businesses.

18 min

AI & Automation

AI Deep Research — How an Agent Searches the Web and Writes the Report Instead of Your Analyst

OpenAI Deep Research, Perplexity, and web-browsing agents are reshaping desk research: a report that takes an analyst 4–8 hours, an agent finishes in 5–20 minutes with source citations. I explain how these tools work, when they genuinely replace a human and when they don't, what ROI looks like, how to build your own research-automation pipeline, and when it makes sense to let the agent do it instead of an employee.

15 min

AI & Automation

AI in Recruitment and HR 2026 — CV Screening Automation, EU AI Act Obligations, and When AI Helps vs Hurts

AI cuts CV screening time by 75%, but recruitment systems are classified as high-risk AI under the EU AI Act — with a full compliance package: human oversight, transparency, technical documentation, EU database registration. I explain what AI in HR can safely do (screening as a filter, chatbot, onboarding), where the line is (autonomous decisions without a human), which tools work for SMEs, and how to avoid legal exposure.

17 min

/// AUTHOR

Paweł Wiszniewski

SEO & GEO Specialist & AI Engineer

SEO/GEO specialist (10 years) and AI engineer (3 years). I build search visibility, AI systems and automations that reduce costs and improve operational efficiency.

LinkedIn Facebook

Signal received?

Terminate
Silence

Initiate protocol. Establish connection. Let's build something loud.

> WAITING_FOR_INPUT...

BIAŁYSTOK, PL

+48 732 022 086 pawel.wiszniewski95@gmail.com

Anatomy of Cost — What Generates the Bills

OpenAI Model Prices (2026)

What Goes Into Input Tokens

Cost per 1000 calls — GPT-4o mini, before and after optimization

Prompt Caching — The Biggest Quick Win

How to Structure Your Prompt for Caching

Query Routing — The Right Model for the Right Task

When GPT-4o mini (60–70% of traffic)

When GPT-4o (25–35% of traffic)

When o3-mini or o1 (5–10% of traffic)

Query routing — traffic distribution in a typical B2B app

Context Compression — Stop the Growing Bill

Strategy 1: Sliding window

Strategy 2: Summarisation

Strategy 3: Semantic pruning

Checklist — 10 Quick Wins

/// RELATED_SERVICES

AI App Development

AI Consulting

/// SOURCES

/// RELATED_RECORDS

Vibe Coding: Complete Guide to AI Coding Tools 2026

AI Deep Research — How an Agent Searches the Web and Writes the Report Instead of Your Analyst

AI in Recruitment and HR 2026 — CV Screening Automation, EU AI Act Obligations, and When AI Helps vs Hurts

Signal received?

TerminateSilence

Terminate
Silence