Is Ollama ready for production?

Ollama is a developer tool optimised for ease of use, not throughput. It handles one request at a time — with many concurrent users the queue grows and latency spikes. For production with more than 3–5 parallel users, use vLLM or Hugging Face TGI, which have proper batching and KV-cache memory management. Exception: if your "production" is a single internal tool used by a few people in shifts (not simultaneously) — Ollama is sufficient and simpler.

How much does running your own LLM cost per month?

Depends on GPU and load. RTX 4090 (owned) — electricity ~$25/mo + GPU amortisation ~$130 (36 mo) = ~$155. A10G on-demand (RunPod) 8h/working-day — ~$150/mo. A100 40 GB (RunPod) 24/7 — ~$1,800/mo. Add DevOps time (~4h/mo monitoring and updates). Break-even vs GPT-4o-mini: at 30M tokens/mo ($4.50) your own infrastructure is more expensive; at 300M tokens/mo ($45) — decisively cheaper. Key: factor in the engineer's time — self-hosting is not "zero operating cost."

Does an open-source LLM match GPT-4o quality?

On narrow tasks after fine-tuning — often yes. Llama 3.1 70B Instruct achieves results close to GPT-4o-mini on standard benchmarks. Qwen2.5 72B regularly beats GPT-4o on coding tasks. Where open-source still lags: complex multi-step reasoning, very long context (128k+), and tasks requiring broad current knowledge. The key rule: always evaluate the model on YOUR data and YOUR task — general benchmarks don't replace a test on your own golden dataset.

How do I migrate an existing app from the OpenAI API to vLLM?

Literally change two lines: base_url and model name in the OpenAI client config. The rest — API calls, streaming, structured outputs — works unchanged because vLLM implements the same REST interface. Common post-migration issues: (1) different system prompt format — Llama uses a different chat template than GPT; fix by providing the system prompt explicitly. (2) Slightly different model behaviour — run your golden dataset and compare metrics. (3) Smaller context window (some models 8k vs GPT-4o 128k) — adjust RAG chunking.

Can I use a QLoRA fine-tuned model with Ollama or vLLM?

Yes. After QLoRA training you have LoRA adapter weights. Two approaches: (1) Merge the adapter into the base model via unsloth.save_pretrained_merged → GGUF file → ollama create my-model -f Modelfile; (2) vLLM with --enable-lora and --lora-modules my-model=/path/to/adapter — this lets you serve multiple LoRA variants on one base model without merging. Ollama is simpler; vLLM with LoRA is more flexible when you have several model variants.

What are the alternatives to Ollama and vLLM?

Hugging Face TGI (Text Generation Inference) — enterprise-grade, supports tensor parallelism and quantisation; good for Mistral and HF models. LM Studio — GUI without a terminal, good for non-technical evaluation. llama.cpp — the raw C++ backend used by Ollama under the hood; maximum quantisation control. Llamafile — a single executable containing both model and server; ideal portability. Jan.ai — open-source LM Studio alternative. For enterprise at large scale: NVIDIA Triton Inference Server with TensorRT-LLM gives the highest throughput on NVIDIA GPUs, but requires specialist DevOps knowledge.

Does 4-bit quantisation reduce model quality?

Slightly — typically 1–3% on benchmarks for Q4_K_M, practically unmeasurable for Q8. In practice a different relationship matters more: a larger quantised model almost always beats a smaller full-precision model at the same VRAM. Llama 70B Q4 (~40 GB) is clearly better than Llama 8B bf16 (~16 GB). That's why Q4 is the self-hosted standard: you maximise the model size that fits on your GPU. Exception: tasks sensitive to numeric precision (amount extraction, calculations) — test Q8 vs Q4 on your own golden dataset there, because quantisation degradation is uneven across task types.

RETURN_TO_BLOG

2026-06-12AI & Automation 14 min

Ollama, vLLM and Self-Hosted LLM — How to Run a Local AI Model in Your Company

Paweł Wiszniewski

SEO & GEO Specialist · AI Engineer

Self-hosting a language model pays off when your monthly API cost exceeds ~$400–600, or when data cannot leave your infrastructure (GDPR, trade secrets, medical data). Below that threshold the OpenAI or Anthropic API is cheaper once you factor in DevOps and GPU costs. Above it — an RTX 4090 (~$1,600) pays for itself in 3–5 months at 1M+ tokens per day. For getting started on a laptop and testing: Ollama. For production with required scale and low latency: vLLM. For evaluating models without a terminal: LM Studio.

A practical guide to self-hosting language models: when your own model pays off more than the OpenAI API, how to configure Ollama on a laptop and vLLM on a GPU server, which GPU to choose, how to connect applications through the OpenAI-compatible API, and how to deploy monitoring for a private LLM in production.

Your AI chatbot handles customer queries and processes internal documents — and every call sends fragments of that data to OpenAI's servers. Your lawyer asks whether this is GDPR-compliant. Finance looks at the invoice: $3,200 a month for the API. The CTO wonders whether it's time for your own infrastructure.

More and more companies are asking this question — and the answer is: it depends, but increasingly "yes." The self-hosted LLM ecosystem (Ollama, vLLM, LM Studio, Hugging Face TGI) has matured enough that deployment takes hours, not weeks. This article shows when, how and on what.

Criterion	API (OpenAI/Anthropic)	Self-hosted (vLLM)
Upfront cost	$0	$300–$6,000 GPU
Cost per token	$0.15–$15 / 1M tok	~$0 (infrastructure)
Break-even	Always <$400/mo	>$400–600/mo API cost
Data privacy	Data sent to the cloud	100% in your infrastructure
Latency	50–500 ms (network)	5–200 ms (local)
Models	GPT-4o, Claude, Gemini	Llama 3, Mistral, Phi-3...
DevOps cost	Zero	High (GPU, monitoring)
Knowledge cutoff	Current	Model training cutoff
Best for	Prototype, startup, small scale	Scale, compliance, regulation

When does a self-hosted LLM pay off?

/// OLLAMA vs vLLM vs LM STUDIO — WHICH FOR WHAT?

Ollama

LAPTOP / DEV

Entry barrier⭐ Very low

GPU requiredOptional (CPU ok)

APIOpenAI-compatible

ModelsLlama, Mistral, Phi...

Scalability✗ Single connection

Best forDev, prototypes, testing

vLLM

PRODUCTION

Entry barrier⭐⭐⭐ Higher

GPU requiredYes (CUDA)

APIOpenAI-compatible

PagedAttention✓ 2–4× more req/s

Scalability✓ Multi-GPU, batching

Best forProduction, high scale

LM Studio

GUI / DESKTOP

Entry barrier⭐ No technical knowledge

GPU requiredOptional

APILocal OpenAI server

ModelsGGUF from HuggingFace

Scalability✗ Desktop only

Best forModel testing, business

API COST AFTER DEPLOYMENT

100%

DATA ON-PREMISES

<200ms

LATENCY vLLM + A100

Four scenarios where your own infrastructure beats the API:

Compliance and privacy — regulated industries (healthcare, law, finance, defence) cannot send data to external providers; HIPAA, GDPR Art. 28, banking secrecy directly prohibit it or require a DPA that OpenAI won't sign with every customer
High token volume — at 5M+ tokens per day ($22.50/day on GPT-4o-mini) your own A100 40 GB pays for itself in 2–3 months; at 20M tokens per day the saving is $50,000 a year
Low latency as a hard requirement — real-time applications (voice AI, interactive UI) need <200 ms; local vLLM with the GPU on the same machine as the app delivers 30–80 ms
Fine-tuned model — after QLoRA training (article #38) you have a GGUF adapter file or LoRA weights; to use them you need a local inference server — Ollama or vLLM

Three scenarios where the API is better:

Startup in the product-building phase — prototype in 30 minutes, no DevOps
Small scale (< $200/mo API) — GPU and engineer costs exceed the savings
You need GPT-4o / Claude — these are not available self-hosted (closed weights)

Ollama — local LLM in 5 minutes

Ollama is the easiest way to run an open model on your own machine. Installation, downloading a model and the first call are literally three commands.

setup_ollama.sh

# macOS / Linux — install with one commandcurl -fsSL https://ollama.ai/install.sh | sh# Download and run a model (auto-download from HuggingFace)ollama run llama3.2:3b           # 3B — runs on any laptopollama run llama3.1:8b           # 8B — needs 8 GB VRAM or 16 GB RAMollama run mistral:7b-instruct   # 7B Mistral — great for instructionsollama run phi3:14b              # 14B Phi-3 — best quality/VRAM ratio# HTTP server on port 11434 (starts automatically)# OpenAI-compatible REST API:curl http://localhost:11434/v1/chat/completions \  -H "Content-Type: application/json" \  -d '{"model":"llama3.1:8b","messages":[{"role":"user","content":"Hello!"}]}'

Connecting an existing OpenAI SDK app to Ollama:

app_with_ollama.py

from openai import OpenAI# Only change base_url — the rest of the code is identical to OpenAIclient = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")response = client.chat.completions.create(    model="llama3.1:8b",    messages=[{"role": "user", "content": "Summarise this contract in 3 sentences."}],    temperature=0.2,)print(response.choices[0].message.content)

Configuring Ollama as a systemd service (Linux/server):

ollama.service

[Unit]Description=Ollama LLM serverAfter=network.target[Service]ExecStart=/usr/local/bin/ollama serveRestart=alwaysUser=ollamaEnvironment=OLLAMA_HOST=0.0.0.0:11434Environment=OLLAMA_MODELS=/opt/models[Install]WantedBy=multi-user.target

When Ollama is enough — and when it isn't:

Ollama works well for: local development, model testing before choosing, small deployments (1–5 concurrent users), integrating with n8n or LangChain while building a pipeline
Ollama is not suitable for: production with many parallel requests (it handles one call at a time), high throughput requirements, deployments needing full GPU utilisation

vLLM — production-ready LLM inference

vLLM is a GPU-optimised inference engine. Its key innovation — PagedAttention — manages KV-cache memory like OS paging, allowing 2–4× more concurrent requests on the same GPU compared to a naive implementation.

setup_vllm.sh

# Installation (requires CUDA 12.x, Python 3.10+)pip install vllm# Start server (OpenAI-compatible API on port 8000)python -m vllm.entrypoints.openai.api_server \  --model meta-llama/Llama-3.1-8B-Instruct \  --gpu-memory-utilization 0.90 \  --max-model-len 8192 \  --dtype bfloat16 \  --port 8000# Fine-tuned model with LoRA adapter:python -m vllm.entrypoints.openai.api_server \  --model meta-llama/Llama-3.1-8B-Instruct \  --enable-lora \  --lora-modules company-model=/opt/lora/checkpoint-500 \  --port 8000

Connecting your app to vLLM — identical code to OpenAI:

app_with_vllm.py

from openai import OpenAIimport asyncioclient = OpenAI(base_url="http://localhost:8000/v1", api_key="vllm")# Batch processing — vLLM handles many parallel requestsasync def process_documents(docs: list[str]) -> list[str]:    import httpx    async with httpx.AsyncClient() as http:        tasks = [            http.post("http://localhost:8000/v1/chat/completions", json={                "model": "meta-llama/Llama-3.1-8B-Instruct",                "messages": [{"role": "user", "content": f"Summarise: {doc}"}],                "max_tokens": 200,            })            for doc in docs        ]        results = await asyncio.gather(*tasks)    return [r.json()["choices"][0]["message"]["content"] for r in results]

Configuration for different GPU setups:

Single A100 80 GB: "--gpu-memory-utilization 0.90 --tensor-parallel-size 1"
Two A100 40 GB cards (70B model): "--tensor-parallel-size 2"
Four RTX 4090 24 GB cards (34B model): "--tensor-parallel-size 4"

Which GPU to choose and what does it cost?

/// GPU: VRAM vs MODEL

Which GPU for which model?

VRAM required for inference (full precision or 4-bit quantisation)

RTX 3060 12 GB

12 GB

7B Q4, 3B bf16

RTX 4090 24 GB

24 GB

7B bf16, 13B Q4

A10G 24 GB (cloud)

24 GB

7B bf16, QLoRA 13B

A100 40 GB (cloud)

40 GB

13B bf16, 34B Q4

A100 80 GB (cloud)

80 GB

70B Q4, 34B bf16

H100 80 GB (cloud)

80 GB

70B bf16, 2× 70B

4-bit

QUANTISATION TO START

4×

LESS VRAM GGUF Q4 vs bf16

24 GB

SWEET SPOT FOR BUSINESS

Practical rule: you need ~2× the VRAM the model occupies in your chosen precision.

What is quantisation (Q4, Q8, GGUF)?

Quantisation compresses model weights from full precision (16-bit, written bf16) to fewer bits — usually 8 (Q8) or 4 (Q4). It's the key technique that lets large models fit on consumer GPUs:

bf16 (full precision) — an 8B model takes ~16 GB VRAM; best quality, needs an expensive GPU
Q8 (8-bit) — ~8 GB for an 8B model; quality loss practically unmeasurable; the safe standard
Q4 (4-bit) — ~5 GB for an 8B model; 1–3% quality loss on benchmarks; the most common self-hosted choice, because it fits a 2× larger model on the same GPU — and a larger Q4 model almost always beats a smaller bf16 one
GGUF — the quantised-model file format used by llama.cpp and Ollama; download ready-made GGUF models from Hugging Face (look for the "-GGUF" suffix)

Practical rule: start with Q4. If you notice quality issues on your task — switch to Q8 or bf16 and compare on your golden dataset.

GPU	VRAM	Cost	Max model (Q4 GGUF)	Throughput (~tok/s)
RTX 3060	12 GB	~$300	7B	25–35
RTX 4090	24 GB	~$1,600	13B or 7B bf16	60–90
A10G 24 GB (cloud)	24 GB	~$1.5/h	13B bf16	80–120
A100 40 GB (cloud)	40 GB	~$2.5/h	34B Q4 or 13B bf16	150–200
A100 80 GB (cloud)	80 GB	~$3.5/h	70B Q4 or 34B bf16	120–180
H100 80 GB (cloud)	80 GB	~$4–8/h	70B bf16 (fast)	300–500

Cloud vs own GPU: For variable workloads (business hours) — cloud on-demand (RunPod, Lambda Labs, Vast.ai) is cheaper: you pay only for uptime. For constant 24/7 load — a dedicated machine pays for itself in 3–6 months. Calculator: monthly cloud cost = hourly rate × 720 h; if > $800 — your own server starts to win.

Which models to choose for self-hosting?

Model	Size	VRAM Q4	Strengths	Weaknesses
Llama 3.1 8B Instruct	8B	~5 GB	Best quality/VRAM, decent multilingual	Weaker reasoning than 70B
Llama 3.1 70B Instruct	70B	~40 GB	Close to GPT-4o-mini	Needs A100
Mistral 7B v0.3	7B	~4.5 GB	Fast, coding, EN	Weaker non-English
Phi-3 Medium 14B	14B	~8 GB	Great reasoning, small footprint	Short context (4k)
Qwen2.5 7B Instruct	7B	~5 GB	Multilingual, coding	Distinctive style
Qwen2.5 72B Instruct	72B	~42 GB	Best open-source 2025	High GPU demands
Gemma 2 9B Instruct	9B	~5.5 GB	Google quality, safe	Smaller ecosystem

Starting recommendation: begin with Llama 3.1 8B Instruct (Ollama: "ollama pull llama3.1:8b") — runs on an RTX 4090 or A10G, handles multiple languages, excellent quality-to-cost ratio. For 70B+ requirements — Qwen2.5 72B or Llama 3.1 70B on an A100 80 GB.

How to connect a self-hosted LLM to existing tools?

Both Ollama and vLLM expose an OpenAI-compatible REST API — meaning any tool that supports OpenAI works immediately, without modification:

LangChain — ChatOpenAI(base_url="http://localhost:11434/v1", model="llama3.1:8b") — one line change
LlamaIndex — OpenAI(api_base="http://localhost:8000/v1") — same pattern
n8n — in the OpenAI node, change "Base URL" to your local server address
Open WebUI — a browser-based GUI for Ollama, works like ChatGPT, available at localhost:3000 after install
Continue.dev — VS Code plugin with a local LLM instead of Copilot; set model and apiBase in config.json

Monitoring and security for self-hosted LLM

A self-hosted LLM is not just a GPU and a model — you need observability and access control, just like with an external API.

Observability:

Langfuse self-hosted (Docker Compose) — traces, tokens, latency, compute costs; works the same as with an external API
Prometheus + Grafana — GPU metrics (temperature, VRAM usage, throughput) via nvidia-smi or DCGM exporter
Your own JSON log with request_id, model, prompt_tokens, completion_tokens, latency_ms, user_id for every call

Access control:

vLLM supports API key auth via "--api-key SECRET" — enforce this from day one
Reverse proxy (nginx or Caddy) with SSL + rate limiting in front of vLLM; never expose the vLLM port directly to the internet
Log every call with IP and user_id for auditing — GDPR compliance requires this

nginx_vllm_config.conf

server {    listen 443 ssl;    server_name llm.yourcompany.com;    ssl_certificate /etc/letsencrypt/live/llm.yourcompany.com/fullchain.pem;    ssl_certificate_key /etc/letsencrypt/live/llm.yourcompany.com/privkey.pem;    location /v1/ {        proxy_pass http://127.0.0.1:8000;        proxy_set_header X-Real-IP $remote_addr;        limit_req zone=llm_limit burst=20 nodelay;    }}limit_req_zone $binary_remote_addr zone=llm_limit:10m rate=30r/m;

Deployment checklist

1.Calculate TCO: API cost per month vs GPU cost + DevOps cost; if API < $400/mo — don't deploy your own model
2.Define compliance requirements — does the data really have to stay on-premises, or would a DPA with an API provider suffice?
3.Start with Ollama locally — verify the chosen model gives adequate quality on your data BEFORE buying a GPU
4.Choose GPU by the rule: VRAM ≥ 2× model size (Q4); for production with 5+ concurrent users — A10G or A100
5.For production: vLLM instead of Ollama — handles parallel requests and PagedAttention
6.Expose the API through nginx with SSL, API key auth and rate limiting; never directly to the internet
7.Deploy GPU monitoring (Prometheus + Grafana) and LLM traces (Langfuse self-hosted)
8.Run a regression test on your golden dataset after every model change
9.Plan rollback — keep the previous model version; switching should take < 5 minutes
10.Review your GDPR policy — even self-hosted deployments require a processing register for personal data the LLM handles

---

I help companies evaluate, choose and deploy self-hosted LLMs — from TCO analysis and GPU selection through vLLM and Ollama configuration to monitoring, security and integration with existing systems. Get in touch — I start with a free 30-minute analysis of your current AI stack and a break-even calculation.

/// RELATED_SERVICES

Need these concepts implemented? Explore the services related to this topic.

Service

AI App Development

Custom AI software and AI-powered web applications. MVP development, full stack engineering, and AI systems programming from scratch to production.

View service Service

AI & Automation

Virtual employees who never sleep. Autonomous agents and workflows.

View service

/// SOURCES

/// RELATED_RECORDS

AI & Automation

Vibe Coding: Complete Guide to AI Coding Tools 2026

Claude Code, Cursor, GitHub Copilot, Codex CLI, Gemini CLI, Lovable, Bolt.new — 60% of all new code worldwide is AI-generated (Gartner, 2026). A complete map of 11 vibe coding tools across 3 categories, with pricing, use cases, and a selection guide for businesses.

18 min

AI & Automation

AI Deep Research — How an Agent Searches the Web and Writes the Report Instead of Your Analyst

OpenAI Deep Research, Perplexity, and web-browsing agents are reshaping desk research: a report that takes an analyst 4–8 hours, an agent finishes in 5–20 minutes with source citations. I explain how these tools work, when they genuinely replace a human and when they don't, what ROI looks like, how to build your own research-automation pipeline, and when it makes sense to let the agent do it instead of an employee.

15 min

AI & Automation

AI in Recruitment and HR 2026 — CV Screening Automation, EU AI Act Obligations, and When AI Helps vs Hurts

AI cuts CV screening time by 75%, but recruitment systems are classified as high-risk AI under the EU AI Act — with a full compliance package: human oversight, transparency, technical documentation, EU database registration. I explain what AI in HR can safely do (screening as a filter, chatbot, onboarding), where the line is (autonomous decisions without a human), which tools work for SMEs, and how to avoid legal exposure.

17 min

/// AUTHOR

Paweł Wiszniewski

SEO & GEO Specialist & AI Engineer

SEO/GEO specialist (10 years) and AI engineer (3 years). I build search visibility, AI systems and automations that reduce costs and improve operational efficiency.

LinkedIn Facebook

Signal received?

Terminate
Silence

Initiate protocol. Establish connection. Let's build something loud.

> WAITING_FOR_INPUT...

BIAŁYSTOK, PL

+48 732 022 086 pawel.wiszniewski95@gmail.com

When does a self-hosted LLM pay off?

Ollama — local LLM in 5 minutes

vLLM — production-ready LLM inference

Which GPU to choose and what does it cost?

Which GPU for which model?

What is quantisation (Q4, Q8, GGUF)?

Which models to choose for self-hosting?

How to connect a self-hosted LLM to existing tools?

Monitoring and security for self-hosted LLM

Deployment checklist

/// RELATED_SERVICES

AI App Development

AI & Automation

/// SOURCES

/// RELATED_RECORDS

Vibe Coding: Complete Guide to AI Coding Tools 2026

AI Deep Research — How an Agent Searches the Web and Writes the Report Instead of Your Analyst

AI in Recruitment and HR 2026 — CV Screening Automation, EU AI Act Obligations, and When AI Helps vs Hurts

Signal received?

TerminateSilence

Terminate
Silence