OpenAI API cost optimisation: caching, routing and context compression
How to cut your OpenAI API bill by 60–80% without reducing response quality — prompt caching, model routing and conversation history compression.
The AI project is live, tests passed, production deployment done — and after one month the OpenAI API bill is four times the budgeted amount. This is not a rare case. GPT-4o costs 16x more than GPT-4o mini for the same task, and an unoptimised agent context grows with every step. The four techniques below reduce costs by 60–80% without any degradation in response quality.
Anatomy of Cost — What Generates the Bills
Every API call has two components: input tokens (prompt + history + system message + tool definitions) and output tokens (model response). Output tokens cost 2–4x more than input — but there are usually far fewer of them. So cost optimisation starts with input tokens, which account for 70–85% of a typical call's cost.
OpenAI Model Prices (2026)
The four most commonly used production models differ in price by an order of magnitude. Choosing the right one for each task is the biggest cost lever:
| Model | Input (per 1M) | Output (per 1M) | Cached input | When to use |
|---|---|---|---|---|
| GPT-4o mini | $0.15 | $0.60 | $0.075 | Simple tasks — 60–70% of traffic |
| GPT-4o | $2.50 | $10.00 | $1.25 | Analysis, code, complex instructions |
| o3-mini | $1.10 | $4.40 | — | Maths, algorithms, reasoning |
| o1 | $15.00 | $60.00 | $7.50 | Hardest multi-step tasks |
What Goes Into Input Tokens
In a typical agentic application, most input cost comes from elements sent on every call — not from the user's message:
- System message — 200–500 tokens, constant for every call
- Tool definitions — 100–300 tokens per tool, always sent
- Conversation history — grows linearly, 500–2000 tok/turn after a few agent steps
- RAG context — 500–4000 tokens depending on retrieval
- User message — 20–200 tokens, usually the smallest part
/// OPTYMALIZACJA KOSZTU WYWOŁANIA API
Koszt per 1000 wywołań — GPT-4o mini, przed i po optymalizacji
Prompt Caching — The Biggest Quick Win
Prompt caching cuts the cost of repeating prompt prefixes by 75%. Available in OpenAI (from GPT-4o) and Anthropic (from Claude 3.5). The cache is created automatically when the same prefix has 1024+ tokens and appears within 5 minutes of the previous call. No configuration needed — just keep constant elements at the top of the prompt.
How to Structure Your Prompt for Caching
The order of elements in the prompt determines cache hit rate. Correct order — from most constant to most frequently changing:
- 1.System message — constant across the application, always cached
- 2.Tool definitions — constant, cached together with the system message
- 3.Context documents — constant per user session, cached
- 4.Conversation history — changes with every turn, not cached
- 5.User message — always new, not cached
Common mistake: putting the date, session ID or user name in the system message. This breaks the cache for every user and session — move those to the user message or a separate context block after the conversation history.
from openai import OpenAIclient = OpenAI()SYSTEM_PROMPT = """You are a customer support assistant for Acme Corp.Rules: only answer questions about Acme products, escalate to a human when the customer is unhappy or asks for a refund > $500.[...2000 tokens of product descriptions, procedures and pricing — constant text, always cached...]"""def chat(history: list, user_message: str) -> tuple[str, dict]: messages = [ {"role": "system", "content": SYSTEM_PROMPT}, *history, {"role": "user", "content": user_message} ] resp = client.chat.completions.create(model="gpt-4o-mini", messages=messages) details = getattr(resp.usage, "prompt_tokens_details", None) cached_tokens = details.cached_tokens if details else 0 return resp.choices[0].message.content, { "total_tokens": resp.usage.total_tokens, "cached_tokens": cached_tokens, "saved_usd": round(cached_tokens * 0.075 / 1_000_000, 6) }
Cache hit rate in practice: an application handling 10,000 calls/day with a 2,000-token system message saves ~$22/day from caching alone. The result is available in `resp.usage.prompt_tokens_details.cached_tokens` — log it and monitor hit rate over time.
Query Routing — The Right Model for the Right Task
60–70% of queries in typical business applications don't need full GPT-4o. A router classifies the query and directs it to the cheapest model that can handle the task. The classifier cost — one GPT-4o mini call with ~20 output tokens — pays for itself immediately when most traffic goes to a model that's 16x cheaper.
When GPT-4o mini (60–70% of traffic)
- Text classification — sentiment, category, user intent
- Simple FAQ answers from supplied context
- Data extraction from structured texts (structured output)
- Summarising short documents — up to 1,000 words
- Translation without specialist terminology
When GPT-4o (25–35% of traffic)
- Analysis of complex documents — contracts, financial reports, specifications
- Code generation and code review
- Tasks with multiple tools and dependencies between calls
- Responses where quality directly affects revenue or customer decisions
- Content generation requiring coherence and creativity
When o3-mini or o1 (5–10% of traffic)
- Mathematical proofs and algorithmic tasks
- Complex SQL generation, scripts or system architecture
- Risk analysis requiring multi-step reasoning
- Tasks where the cost of an error far exceeds the cost of a token
import instructorfrom openai import OpenAIfrom pydantic import BaseModelfrom typing import Literalclient = OpenAI()ic = instructor.from_openai(OpenAI())class RouteDecision(BaseModel): model: Literal["gpt-4o-mini", "gpt-4o", "o3-mini"] reason: strdef classify_query(query: str) -> RouteDecision: return ic.chat.completions.create( model="gpt-4o-mini", response_model=RouteDecision, messages=[ {"role": "system", "content": "Decide which OpenAI model to use. gpt-4o-mini: simple questions, classification, extraction, FAQ. gpt-4o: analysis, code, complex reasoning. o3-mini: maths, algorithms, multi-step reasoning."}, {"role": "user", "content": f"Query: {query}"} ] )def smart_chat(query: str, history: list) -> str: route = classify_query(query) resp = client.chat.completions.create( model=route.model, messages=[*history, {"role": "user", "content": query}] ) return resp.choices[0].message.content
Important: the router is an extra API call — it adds 50–100ms of latency. If latency is critical, run routing in parallel with the first LLM call, and cache routing decisions per query class (e.g. all "order status questions" → always mini).
/// STRATEGIA ROUTINGU MODELI
Routing zapytań — rozkład ruchu w typowej aplikacji B2B
Context Compression — Stop the Growing Bill
In agents and chatbots, conversation history grows with every step. After 20 turns it can reach 8,000+ tokens sent on every call. Three strategies, from simplest to most precise:
Strategy 1: Sliding window
Keep only the last N messages. Pros: zero extra API calls, two lines of code. Cons: loss of long-term memory — the model doesn't remember early conversation context. Good for chatbots where context older than 10 turns rarely matters.
Strategy 2: Summarisation
Every N turns, compress old context into a 300-token summary using GPT-4o mini. Compression cost: ~$0.002 per session. Saving: 60–70% of history tokens while preserving key facts from the whole conversation. Good for long sessions where users return to earlier topics.
Strategy 3: Semantic pruning
Score the semantic distance of each historical message from the current query (cosine similarity of embeddings). Remove messages above the distance threshold. Most precise — keeps the most relevant fragments without lossy compression. Worth implementing for sessions of 50+ turns with a wide variety of topics.
from openai import OpenAIclient = OpenAI()def compress_history(history: list[dict], keep_last: int = 6, max_tokens: int = 300) -> list[dict]: if len(history) <= keep_last: return history old_msgs = history[:-keep_last] recent = history[-keep_last:] old_text = "".join(f"{m['role']}: {m['content']}" for m in old_msgs) resp = client.chat.completions.create( model="gpt-4o-mini", messages=[ {"role": "system", "content": f"Compress this conversation to max {max_tokens} tokens. Keep key facts, decisions and context. Remove greetings and small talk."}, {"role": "user", "content": old_text} ] ) summary = resp.choices[0].message.content return [ {"role": "system", "content": f"[Earlier conversation summary]: {summary}"}, *recent ]
In practice: compression every 10 turns reduces context size by 65–70% while maintaining 95%+ response accuracy. Measure ROUGE-L between full-history and compressed responses — if it drops below 0.9, increase `max_tokens` or reduce `keep_last`.
Checklist — 10 Quick Wins
Implement in this order — each point is an independent optimisation:
- 1.Check the usage dashboard — find the 20% of calls generating 80% of costs
- 2.Move constant elements to the top of the prompt — caching activates automatically
- 3.Limit max_tokens — set a sensible cap instead of leaving the default
- 4.Replace GPT-4o with mini wherever possible — run an A/B quality test first
- 5.Shorten the system message — remove repetition and instructions that don't change results
- 6.Remove unused tool definitions from calls where tools aren't needed
- 7.Implement sliding window for conversation history (keep_last=10)
- 8.Add model routing for tasks that can be classified
- 9.Batch API for non-interactive tasks — 50% cheaper, asynchronous processing
- 10.Per-endpoint monitoring — alert when cost-per-call exceeds a set threshold
| Optimisation | Difficulty | Saving | Implementation time |
|---|---|---|---|
| Prompt caching | Low | 40–75% of input cost | 2h |
| Limit max_tokens | Low | 10–30% | 30 min |
| Routing — mini for simple tasks | Medium | 80–90% on those calls | 1–2 days (A/B test) |
| Sliding window history | Low | 20–50% on long sessions | 2h |
| Full model routing | Medium | 50–70% of total cost | 2–3 days |
| Batch API | Low | 50% on offline tasks | 3h |
---
I run AI cost audits for companies — I analyse API call logs, identify optimisation opportunities and implement routing, caching and context compression. Typical result: 50–70% cost reduction in 2 weeks. Get in touch — I start with an analysis of your API logs.
/// RELATED_RECORDS
How AI Reads Invoices from Email and Enters Them into ERP
AI can automatically read an invoice from an email attachment — PDF, scan, or phone photo — and enter the data directly into an ERP system without any manual retyping. Full automation of cost invoice processing: from the mailbox to accounting.
Where to Start with AI Implementation in Your Company
AI implementation starts not with choosing a tool, but with identifying one repetitive process that wastes the most human time. Learn step by step how to select, map, and automate that process.
How to Build a Company Internal Knowledge Base with AI (RAG in Practice)
An internal knowledge base built on RAG lets you create your own company chatbot that answers only from your company's documents — not the model's guesses. Safe, up-to-date, precise AI with full control over your data.
Signal received?
Terminate
Silence
Initiate protocol. Establish connection. Let's build something loud.
