OpenAI vs Anthropic vs Gemini vs Open-Source — How to Choose an LLM for Your Application
A practical guide to choosing the right LLM: when OpenAI, when Claude, when Gemini, and when open-source. With a comparison table, decision tree and multi-model router code.
You open the pricing pages of OpenAI, Anthropic and Google simultaneously and find that every provider claims to have "the best model in the world." GPT-4o, GPT-4o-mini, o3-mini, Claude Sonnet, Claude Haiku, Gemini 2.0 Flash, Gemini 1.5 Pro, Llama 3.3, Qwen 2.5, Mistral — the list grows faster than you can read, and every new model is announced as a breakthrough.
Good news: the decision doesn't have to be hard. No model is best at everything — each has its niche. Here's how to find the right one for your task without reading 50 benchmark reports.
| Criterion | OpenAI | Anthropic | Google Gemini | Open-source |
|---|---|---|---|---|
| Structured JSON | Native JSONSchema | Via SDK (Instructor) | Yes | Depends on model |
| Hallucination rate | Low | Lowest | Medium | High (7B), low (70B+) |
| Context window | 128k tokens | 200k tokens | 1M tokens (Gemini 1.5) | 128k tokens |
| Multimodal | Image | Image | Image + video + audio | Vision models |
| Price input / 1M | $0.15–$2.50 | $0.25–$3.00 | $0.075–$1.25 | ~$0 (GPU CAPEX) |
| Fine-tuning | Paid | No public API | No | Full control |
| GDPR self-host | No | No | No | Yes |
How NOT to think about model selection?
Most teams make the same mistakes before arriving at the right choice:
- Benchmark chasing — MMLU, HumanEval, MT-Bench measure general intelligence, not your use case. The model that wins benchmarks may lose on your data
- Single-provider loyalty — vendor lock-in is a financial and technical risk; none of the leaders has held the top position for more than a year
- One model for everything — like using a Ferrari for city commutes and a truck on the motorway; you need a fleet matched to the task
- Ignoring latency costs — a model 2× more expensive but 3× faster can have better total cost for a real-time application
- No tests on your own data — a 30-minute eval on 50 of your examples is worth more than a week reading comparison reports
OpenAI — ecosystem and structured outputs
OpenAI has built the most mature ecosystem for developers: Assistants API, Batch API, native structured outputs with JSONSchema validation, function calling developed for the longest time. Integrations with LangChain, LlamaIndex, Instructor, DeepEval — virtually every AI tool has OpenAI as a first-class citizen.
Where OpenAI leads:
- Structured outputs — native JSONSchema compliance guarantee, zero JSON structure errors. Not "almost always JSON" but truly always
- GPT-4o-mini — at $0.15/1M input tokens the cheapest model in the "good enough" class for 80% of tasks
- Batch API — 50% discount on asynchronous offline tasks: email classification, document analysis, overnight report generation
- Function calling and agents — the most mature and predictable API, ideal for multi-agent systems
- Reasoning — o3-mini — tasks requiring multi-step reasoning: maths, planning, logical analysis
When OpenAI is not the optimal choice:
- Document analysis over 100 pages — 128k context vs 1M in Gemini 1.5 Pro
- Data with GDPR geographic location requirements
- Fine-tuning on a specialist domain — more expensive and less flexible than open-source
Anthropic Claude — instruction quality and safety
Claude excels in two areas: precise execution of long, complex system instructions, and consistently the lowest hallucination rate among commercial models on factual data tests.
Where Claude leads:
- Hallucination rate — in TruthfulQA benchmarks and internal RAG tests repeatedly the lowest among commercial models
- Coding — Claude Sonnet is the preferred model among senior engineers for code review and complex multi-file code generation
- Complex system instructions — faithfully follows a 5,000-token system prompt throughout a full multi-turn conversation
- Document analysis — 200k context with full quality (GPT-4o loses precision above ~64k tokens)
- Safety — built-in Constitutional AI refuses dangerous requests without additional guardrailing
When Claude is not the optimal choice:
- Native JSON validation — requires an additional SDK like Instructor or schema prompting
- Cost of simple tasks — Claude Sonnet is an order of magnitude more expensive than GPT-4o-mini
- Lacks Batch API and some advanced features available at OpenAI
Google Gemini — context and multimodality
Gemini has one advantage that no other commercial model beats: a 1 million token context window in Gemini 1.5 Pro. That's the equivalent of 1,500 pages of documents, an entire project codebase, or 20 hours of transcripts — in a single call.
Where Gemini leads:
- Long context — the only model with a native 1M token window without quality compromises at the retrieval level
- Gemini Flash — $0.075/1M input tokens, the fastest and cheapest "good" model on the market, ideal for classification
- Multimodal — text, image, video and audio in one API call without separate endpoints
- Google Workspace — native connection to Google Drive, Gmail, BigQuery without additional integrations
- Price at large contexts — at 128k+ context Gemini 1.5 Pro is cheaper than GPT-4o at comparable quality
When Gemini is not the optimal choice:
- Native structured output validation (weaker than OpenAI)
- Smaller tool ecosystem — LangChain, DeepEval, Instructor treat Gemini as a second-class citizen
- Applications requiring strict repeatability and deterministic results
Open-source — privacy, control and scale
Open-source is not "cheaper GPT" — it's a different trade-off: full data control, fine-tuning capability on your domain, zero API costs when self-hosting at scale.
Leading models:
- Llama 3.3 70B — benchmark-comparable to GPT-4o-mini in many tasks, open weights, runs on A100 or 4× RTX 4090
- Qwen 2.5 — strong in code and maths, good structured outputs via vLLM with JSON grammar
- Mistral Large/Small — European provider, full GDPR control, strong in European languages
Where open-source leads:
- Data privacy — prompts and data do NOT leave your infrastructure — critical for healthcare, law, finance
- Fine-tuning — full control: LoRA or QLoRA on your own data, without provider permission or cost
- Cost at scale — above ~10M tokens/month a GPU server vs API costs dramatically less
- Edge and offline deployment — AI without internet: mobile apps, IoT devices, isolated networks
When open-source is not the optimal choice:
- Small teams without DevOps — GPU management costs engineer time, not just money
- Business-critical quality — 70B models are still weaker than GPT-4o on complex multi-step tasks
- Fast prototype — vLLM/Ollama setup takes hours; OpenAI API works in 5 minutes
/// PORÓWNANIE DOSTAWCÓW LLM
Każdy dostawca ma inną niszę — żaden nie jest najlepszy do wszystkiego
Wybór powinien wynikać z wymagań zadania, nie z popularności
Which model for which task?
| Task | Model | Why | Cost/1M input |
|---|---|---|---|
| FAQ chatbot | GPT-4o-mini | Sufficient quality, cheapest in OpenAI | $0.15 |
| Structured JSON with validation | GPT-4o | Native JSONSchema, zero structure errors | $2.50 |
| PDF analysis over 100 pages | Gemini 1.5 Pro | The only model with 1M context | $1.25 |
| Code review and generation | Claude Sonnet | Lowest hallucination, complex instructions | $3.00 |
| Email / document classification | Gemini Flash | Cheapest "sufficient" model | $0.075 |
| Reasoning — maths, planning | o3-mini | Multi-step reasoning specialisation | $1.10 |
| Sensitive data — medical, legal | Llama 3.3 70B self-host | Zero data leakage, full control | ~$0 GPU |
| Domain fine-tuning | Llama or Qwen | Open weights, LoRA on your data | GPU CAPEX |
/// DRZEWO DECYZYJNE: KTÓRY MODEL?
5 pytań zamiast benchmarków
Odpowiedz na pierwsze pasujące pytanie — to twój model startowy do testu
Multi-model strategy — don't choose just one
The best production AI applications don't use a single model. They use a router: a cheap model classifies the task, an expensive model handles complex cases. The architecture costs one extra component but returns a 60–80% cost reduction.
Example split at 1M calls/month:
- 70% of queries → GPT-4o-mini ($0.15/1M) — FAQ, classification, simple generation
- 25% of queries → GPT-4o ($2.50/1M) — structured outputs, complex contexts
- 5% of queries → Claude Sonnet ($3.00/1M) — code review, high-stakes decisions
Total cost: $0.15 × 0.70 + $2.50 × 0.25 + $3.00 × 0.05 = $0.88/1M tokens instead of $3.00 with a single model. 71% saving.
# model_router.pyfrom enum import Enumfrom openai import OpenAIfrom anthropic import Anthropicfrom dataclasses import dataclassclass TaskType(Enum): FAST_CHEAP = "fast" STRUCTURED_JSON = "structured" REASONING = "reasoning" CODE_REVIEW = "code"@dataclassclass ModelConfig: provider: str model: str max_tokens: intROUTING_TABLE = { TaskType.FAST_CHEAP: ModelConfig("openai", "gpt-4o-mini", 4096), TaskType.STRUCTURED_JSON: ModelConfig("openai", "gpt-4o", 8192), TaskType.REASONING: ModelConfig("openai", "o3-mini", 16384), TaskType.CODE_REVIEW: ModelConfig("anthropic", "claude-sonnet-4-6", 8192),}def classify_task(user_message: str) -> TaskType: keywords_code = ["review", "code", "function", "bug", "refactor"] keywords_math = ["calculate", "solve", "optimise", "plan", "reason"] keywords_json = ["data", "json", "list", "table", "format", "schema"] msg = user_message.lower() if any(k in msg for k in keywords_code): return TaskType.CODE_REVIEW if any(k in msg for k in keywords_math): return TaskType.REASONING if any(k in msg for k in keywords_json): return TaskType.STRUCTURED_JSON return TaskType.FAST_CHEAPdef route_and_call(user_message: str, system: str = "") -> tuple[str, str]: task = classify_task(user_message) config = ROUTING_TABLE[task] msgs = [{"role": "user", "content": user_message}] if system: msgs = [{"role": "system", "content": system}] + msgs if config.provider == "openai": resp = OpenAI().chat.completions.create( model=config.model, messages=msgs, max_tokens=config.max_tokens) return resp.choices[0].message.content, config.model if config.provider == "anthropic": resp = Anthropic().messages.create( model=config.model, messages=[{"role": "user", "content": user_message}], system=system, max_tokens=config.max_tokens) return resp.content[0].text, config.model raise ValueError("Unknown provider: " + config.provider)
The keyword classifier can be replaced with GPT-4o-mini as the router itself — cost $0.002 per 1,000 classifications, better precision for ambiguous cases.
Most common mistakes when choosing a model
- Decision based on benchmarks without testing your own data — spend 1 hour evaluating 50 of your own examples before paying for a subscription or switching providers
- Hard-coding the model name — don't write "gpt-4o-2024-11-20" literally in 10 places; store it in a config constant or an environment variable MODEL_NAME
- Ignoring structured outputs — if output must be JSON, OpenAI's native JSONSchema mode eliminates all validation code and retry logic; worth paying more for
- Underestimating latency costs — for a real-time chatbot, p95 latency matters more than token price
- Self-hosting without capacity planning — A100 GPU costs ~$3/h on cloud; profitable above ~5–10M tokens/month; below that, API is cheaper and simpler to maintain
- No fallback strategy — what happens when OpenAI has an outage? Design a fallback provider from day one
Checklist before choosing a model
- 1.Define the task precisely: classification, generation, RAG, structured JSON, reasoning, code review?
- 2.Test at least 3 models on 30–50 of your own examples — not just general benchmarks
- 3.Measure p95 latency for your use case, not just overall throughput
- 4.Calculate monthly cost at your planned volume — compare API vs GPU self-hosting
- 5.Check GDPR/HIPAA requirements — can prompts and data leave your infrastructure?
- 6.Design an LLMClient abstraction — it will let you swap models without refactoring the whole codebase
- 7.Identify the subset of tasks suitable for routing to a cheaper model
- 8.Plan quality monitoring — a provider model update can change results without warning
---
I help companies choose and deploy the right LLM — from requirements analysis and benchmarking on your own data to implementing a multi-model router and quality monitoring. Get in touch — I start with a 30-minute analysis of your use case and a recommendation for a starting model.
/// RELATED_RECORDS
How AI Reads Invoices from Email and Enters Them into ERP
AI can automatically read an invoice from an email attachment — PDF, scan, or phone photo — and enter the data directly into an ERP system without any manual retyping. Full automation of cost invoice processing: from the mailbox to accounting.
Where to Start with AI Implementation in Your Company
AI implementation starts not with choosing a tool, but with identifying one repetitive process that wastes the most human time. Learn step by step how to select, map, and automate that process.
How to Build a Company Internal Knowledge Base with AI (RAG in Practice)
An internal knowledge base built on RAG lets you create your own company chatbot that answers only from your company's documents — not the model's guesses. Safe, up-to-date, precise AI with full control over your data.
Signal received?
Terminate
Silence
Initiate protocol. Establish connection. Let's build something loud.
