RETURN_TO_BLOG
AI & Automation 13 min

OpenAI API cost optimisation: caching, routing and context compression

How to cut your OpenAI API bill by 60–80% without reducing response quality — prompt caching, model routing and conversation history compression.

The AI project is live, tests passed, production deployment done — and after one month the OpenAI API bill is four times the budgeted amount. This is not a rare case. GPT-4o costs 16x more than GPT-4o mini for the same task, and an unoptimised agent context grows with every step. The four techniques below reduce costs by 60–80% without any degradation in response quality.

Anatomy of Cost — What Generates the Bills

Every API call has two components: input tokens (prompt + history + system message + tool definitions) and output tokens (model response). Output tokens cost 2–4x more than input — but there are usually far fewer of them. So cost optimisation starts with input tokens, which account for 70–85% of a typical call's cost.

OpenAI Model Prices (2026)

The four most commonly used production models differ in price by an order of magnitude. Choosing the right one for each task is the biggest cost lever:

ModelInput (per 1M)Output (per 1M)Cached inputWhen to use
GPT-4o mini$0.15$0.60$0.075Simple tasks — 60–70% of traffic
GPT-4o$2.50$10.00$1.25Analysis, code, complex instructions
o3-mini$1.10$4.40Maths, algorithms, reasoning
o1$15.00$60.00$7.50Hardest multi-step tasks

What Goes Into Input Tokens

In a typical agentic application, most input cost comes from elements sent on every call — not from the user's message:

  • System message — 200–500 tokens, constant for every call
  • Tool definitions — 100–300 tokens per tool, always sent
  • Conversation history — grows linearly, 500–2000 tok/turn after a few agent steps
  • RAG context — 500–4000 tokens depending on retrieval
  • User message — 20–200 tokens, usually the smallest part

/// OPTYMALIZACJA KOSZTU WYWOŁANIA API

Koszt per 1000 wywołań — GPT-4o mini, przed i po optymalizacji

PRZED OPTYMALIZACJĄ
System message (2000 tok)$0.300
Tool definitions (400 tok)$0.060
Historia rozmowy (1500 tok)$0.225
Wiadomość użytkownika (100 tok)$0.015
SUMA / 1000 WYWOŁAŃ$0.600
PO OPTYMALIZACJI
System message — cachedcached$0.015
Tool definitions — cachedcached$0.006
Historia skompresowana (600 tok)$0.090
Wiadomość użytkownika$0.015
SUMA / 1000 WYWOŁAŃ$0.126
-79%
REDUKCJA KOSZTÓW
75%
TANIEJ CACHED TOKENS
10x
RÓŻNICA MINI VS FULL

Prompt Caching — The Biggest Quick Win

Prompt caching cuts the cost of repeating prompt prefixes by 75%. Available in OpenAI (from GPT-4o) and Anthropic (from Claude 3.5). The cache is created automatically when the same prefix has 1024+ tokens and appears within 5 minutes of the previous call. No configuration needed — just keep constant elements at the top of the prompt.

How to Structure Your Prompt for Caching

The order of elements in the prompt determines cache hit rate. Correct order — from most constant to most frequently changing:

  1. 1.System message — constant across the application, always cached
  2. 2.Tool definitions — constant, cached together with the system message
  3. 3.Context documents — constant per user session, cached
  4. 4.Conversation history — changes with every turn, not cached
  5. 5.User message — always new, not cached

Common mistake: putting the date, session ID or user name in the system message. This breaks the cache for every user and session — move those to the user message or a separate context block after the conversation history.

prompt_caching.py
from openai import OpenAIclient = OpenAI()SYSTEM_PROMPT = """You are a customer support assistant for Acme Corp.Rules: only answer questions about Acme products, escalate to a human when the customer is unhappy or asks for a refund > $500.[...2000 tokens of product descriptions, procedures and pricing — constant text, always cached...]"""def chat(history: list, user_message: str) -> tuple[str, dict]:    messages = [        {"role": "system", "content": SYSTEM_PROMPT},        *history,        {"role": "user", "content": user_message}    ]    resp = client.chat.completions.create(model="gpt-4o-mini", messages=messages)    details = getattr(resp.usage, "prompt_tokens_details", None)    cached_tokens = details.cached_tokens if details else 0    return resp.choices[0].message.content, {        "total_tokens": resp.usage.total_tokens,        "cached_tokens": cached_tokens,        "saved_usd": round(cached_tokens * 0.075 / 1_000_000, 6)    }

Cache hit rate in practice: an application handling 10,000 calls/day with a 2,000-token system message saves ~$22/day from caching alone. The result is available in `resp.usage.prompt_tokens_details.cached_tokens` — log it and monitor hit rate over time.

Query Routing — The Right Model for the Right Task

60–70% of queries in typical business applications don't need full GPT-4o. A router classifies the query and directs it to the cheapest model that can handle the task. The classifier cost — one GPT-4o mini call with ~20 output tokens — pays for itself immediately when most traffic goes to a model that's 16x cheaper.

When GPT-4o mini (60–70% of traffic)

  • Text classification — sentiment, category, user intent
  • Simple FAQ answers from supplied context
  • Data extraction from structured texts (structured output)
  • Summarising short documents — up to 1,000 words
  • Translation without specialist terminology

When GPT-4o (25–35% of traffic)

  • Analysis of complex documents — contracts, financial reports, specifications
  • Code generation and code review
  • Tasks with multiple tools and dependencies between calls
  • Responses where quality directly affects revenue or customer decisions
  • Content generation requiring coherence and creativity

When o3-mini or o1 (5–10% of traffic)

  • Mathematical proofs and algorithmic tasks
  • Complex SQL generation, scripts or system architecture
  • Risk analysis requiring multi-step reasoning
  • Tasks where the cost of an error far exceeds the cost of a token
model_router.py
import instructorfrom openai import OpenAIfrom pydantic import BaseModelfrom typing import Literalclient = OpenAI()ic = instructor.from_openai(OpenAI())class RouteDecision(BaseModel):    model: Literal["gpt-4o-mini", "gpt-4o", "o3-mini"]    reason: strdef classify_query(query: str) -> RouteDecision:    return ic.chat.completions.create(        model="gpt-4o-mini",        response_model=RouteDecision,        messages=[            {"role": "system", "content": "Decide which OpenAI model to use. gpt-4o-mini: simple questions, classification, extraction, FAQ. gpt-4o: analysis, code, complex reasoning. o3-mini: maths, algorithms, multi-step reasoning."},            {"role": "user", "content": f"Query: {query}"}        ]    )def smart_chat(query: str, history: list) -> str:    route = classify_query(query)    resp = client.chat.completions.create(        model=route.model,        messages=[*history, {"role": "user", "content": query}]    )    return resp.choices[0].message.content

Important: the router is an extra API call — it adds 50–100ms of latency. If latency is critical, run routing in parallel with the first LLM call, and cache routing decisions per query class (e.g. all "order status questions" → always mini).

/// STRATEGIA ROUTINGU MODELI

Routing zapytań — rozkład ruchu w typowej aplikacji B2B

INPUT QUERYCLASSIFIER (GPT-4o mini, ~20 tok)ROUTE TO MODEL
65%
Proste zadaniaGPT-4o mini$0.15 / 1M tok
FAQ, klasyfikacja, ekstrakcja, podsumowanie krótkich tekstów
30%
Złożone zadaniaGPT-4o$2.50 / 1M tok
Analiza dokumentów, generowanie kodu, wieloetapowe instrukcje
5%
Rozumowanieo3-mini$1.10 / 1M tok
Matematyka, algorytmy, złożone generowanie kodu
$0.73
AVG KOSZT PER 1M TOK (Z ROUTINGIEM)
-71%
VS $2.50 FLAT (GPT-4o BEZ ROUTINGU)
<5ms
OVERHEAD KLASYFIKATORA

Context Compression — Stop the Growing Bill

In agents and chatbots, conversation history grows with every step. After 20 turns it can reach 8,000+ tokens sent on every call. Three strategies, from simplest to most precise:

Strategy 1: Sliding window

Keep only the last N messages. Pros: zero extra API calls, two lines of code. Cons: loss of long-term memory — the model doesn't remember early conversation context. Good for chatbots where context older than 10 turns rarely matters.

Strategy 2: Summarisation

Every N turns, compress old context into a 300-token summary using GPT-4o mini. Compression cost: ~$0.002 per session. Saving: 60–70% of history tokens while preserving key facts from the whole conversation. Good for long sessions where users return to earlier topics.

Strategy 3: Semantic pruning

Score the semantic distance of each historical message from the current query (cosine similarity of embeddings). Remove messages above the distance threshold. Most precise — keeps the most relevant fragments without lossy compression. Worth implementing for sessions of 50+ turns with a wide variety of topics.

context_compression.py
from openai import OpenAIclient = OpenAI()def compress_history(history: list[dict], keep_last: int = 6, max_tokens: int = 300) -> list[dict]:    if len(history) <= keep_last:        return history    old_msgs = history[:-keep_last]    recent = history[-keep_last:]    old_text = "".join(f"{m['role']}: {m['content']}" for m in old_msgs)    resp = client.chat.completions.create(        model="gpt-4o-mini",        messages=[            {"role": "system", "content": f"Compress this conversation to max {max_tokens} tokens. Keep key facts, decisions and context. Remove greetings and small talk."},            {"role": "user", "content": old_text}        ]    )    summary = resp.choices[0].message.content    return [        {"role": "system", "content": f"[Earlier conversation summary]: {summary}"},        *recent    ]

In practice: compression every 10 turns reduces context size by 65–70% while maintaining 95%+ response accuracy. Measure ROUGE-L between full-history and compressed responses — if it drops below 0.9, increase `max_tokens` or reduce `keep_last`.

Checklist — 10 Quick Wins

Implement in this order — each point is an independent optimisation:

  1. 1.Check the usage dashboard — find the 20% of calls generating 80% of costs
  2. 2.Move constant elements to the top of the prompt — caching activates automatically
  3. 3.Limit max_tokens — set a sensible cap instead of leaving the default
  4. 4.Replace GPT-4o with mini wherever possible — run an A/B quality test first
  5. 5.Shorten the system message — remove repetition and instructions that don't change results
  6. 6.Remove unused tool definitions from calls where tools aren't needed
  7. 7.Implement sliding window for conversation history (keep_last=10)
  8. 8.Add model routing for tasks that can be classified
  9. 9.Batch API for non-interactive tasks — 50% cheaper, asynchronous processing
  10. 10.Per-endpoint monitoring — alert when cost-per-call exceeds a set threshold
OptimisationDifficultySavingImplementation time
Prompt cachingLow40–75% of input cost2h
Limit max_tokensLow10–30%30 min
Routing — mini for simple tasksMedium80–90% on those calls1–2 days (A/B test)
Sliding window historyLow20–50% on long sessions2h
Full model routingMedium50–70% of total cost2–3 days
Batch APILow50% on offline tasks3h

---

I run AI cost audits for companies — I analyse API call logs, identify optimisation opportunities and implement routing, caching and context compression. Typical result: 50–70% cost reduction in 2 weeks. Get in touch — I start with an analysis of your API logs.

/// AUTHOR
Paweł Wiszniewski – AI & Web Engineer

Paweł Wiszniewski

Senior Full-Stack Engineer & AI Architect

8+ years building AI systems, automations, and scalable web applications that reduce costs and improve operational efficiency.

Signal received?

Terminate
Silence

Initiate protocol. Establish connection. Let's build something loud.

> WAITING_FOR_INPUT...