Ollama, vLLM and Self-Hosted LLM — How to Run a Local AI Model in Your Company
Self-hosting a language model pays off when your monthly API cost exceeds ~$400–600, or when data cannot leave your infrastructure (GDPR, trade secrets, medical data). Below that threshold the OpenAI or Anthropic API is cheaper once you factor in DevOps and GPU costs. Above it — an RTX 4090 (~$1,600) pays for itself in 3–5 months at 1M+ tokens per day. For getting started on a laptop and testing: Ollama. For production with required scale and low latency: vLLM. For evaluating models without a terminal: LM Studio.
A practical guide to self-hosting language models: when your own model pays off more than the OpenAI API, how to configure Ollama on a laptop and vLLM on a GPU server, which GPU to choose, how to connect applications through the OpenAI-compatible API, and how to deploy monitoring for a private LLM in production.
Your AI chatbot handles customer queries and processes internal documents — and every call sends fragments of that data to OpenAI's servers. Your lawyer asks whether this is GDPR-compliant. Finance looks at the invoice: $3,200 a month for the API. The CTO wonders whether it's time for your own infrastructure.
More and more companies are asking this question — and the answer is: it depends, but increasingly "yes." The self-hosted LLM ecosystem (Ollama, vLLM, LM Studio, Hugging Face TGI) has matured enough that deployment takes hours, not weeks. This article shows when, how and on what.
| Criterion | API (OpenAI/Anthropic) | Self-hosted (vLLM) |
|---|---|---|
| Upfront cost | $0 | $300–$6,000 GPU |
| Cost per token | $0.15–$15 / 1M tok | ~$0 (infrastructure) |
| Break-even | Always <$400/mo | >$400–600/mo API cost |
| Data privacy | Data sent to the cloud | 100% in your infrastructure |
| Latency | 50–500 ms (network) | 5–200 ms (local) |
| Models | GPT-4o, Claude, Gemini | Llama 3, Mistral, Phi-3... |
| DevOps cost | Zero | High (GPU, monitoring) |
| Knowledge cutoff | Current | Model training cutoff |
| Best for | Prototype, startup, small scale | Scale, compliance, regulation |
When does a self-hosted LLM pay off?
/// OLLAMA vs vLLM vs LM STUDIO — WHICH FOR WHAT?
Four scenarios where your own infrastructure beats the API:
- Compliance and privacy — regulated industries (healthcare, law, finance, defence) cannot send data to external providers; HIPAA, GDPR Art. 28, banking secrecy directly prohibit it or require a DPA that OpenAI won't sign with every customer
- High token volume — at 5M+ tokens per day ($22.50/day on GPT-4o-mini) your own A100 40 GB pays for itself in 2–3 months; at 20M tokens per day the saving is $50,000 a year
- Low latency as a hard requirement — real-time applications (voice AI, interactive UI) need <200 ms; local vLLM with the GPU on the same machine as the app delivers 30–80 ms
- Fine-tuned model — after QLoRA training (article #38) you have a GGUF adapter file or LoRA weights; to use them you need a local inference server — Ollama or vLLM
Three scenarios where the API is better:
- Startup in the product-building phase — prototype in 30 minutes, no DevOps
- Small scale (< $200/mo API) — GPU and engineer costs exceed the savings
- You need GPT-4o / Claude — these are not available self-hosted (closed weights)
Ollama — local LLM in 5 minutes
Ollama is the easiest way to run an open model on your own machine. Installation, downloading a model and the first call are literally three commands.
# macOS / Linux — install with one commandcurl -fsSL https://ollama.ai/install.sh | sh# Download and run a model (auto-download from HuggingFace)ollama run llama3.2:3b # 3B — runs on any laptopollama run llama3.1:8b # 8B — needs 8 GB VRAM or 16 GB RAMollama run mistral:7b-instruct # 7B Mistral — great for instructionsollama run phi3:14b # 14B Phi-3 — best quality/VRAM ratio# HTTP server on port 11434 (starts automatically)# OpenAI-compatible REST API:curl http://localhost:11434/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model":"llama3.1:8b","messages":[{"role":"user","content":"Hello!"}]}'
Connecting an existing OpenAI SDK app to Ollama:
from openai import OpenAI# Only change base_url — the rest of the code is identical to OpenAIclient = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")response = client.chat.completions.create( model="llama3.1:8b", messages=[{"role": "user", "content": "Summarise this contract in 3 sentences."}], temperature=0.2,)print(response.choices[0].message.content)
Configuring Ollama as a systemd service (Linux/server):
[Unit]Description=Ollama LLM serverAfter=network.target[Service]ExecStart=/usr/local/bin/ollama serveRestart=alwaysUser=ollamaEnvironment=OLLAMA_HOST=0.0.0.0:11434Environment=OLLAMA_MODELS=/opt/models[Install]WantedBy=multi-user.target
When Ollama is enough — and when it isn't:
- Ollama works well for: local development, model testing before choosing, small deployments (1–5 concurrent users), integrating with n8n or LangChain while building a pipeline
- Ollama is not suitable for: production with many parallel requests (it handles one call at a time), high throughput requirements, deployments needing full GPU utilisation
vLLM — production-ready LLM inference
vLLM is a GPU-optimised inference engine. Its key innovation — PagedAttention — manages KV-cache memory like OS paging, allowing 2–4× more concurrent requests on the same GPU compared to a naive implementation.
# Installation (requires CUDA 12.x, Python 3.10+)pip install vllm# Start server (OpenAI-compatible API on port 8000)python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-3.1-8B-Instruct \ --gpu-memory-utilization 0.90 \ --max-model-len 8192 \ --dtype bfloat16 \ --port 8000# Fine-tuned model with LoRA adapter:python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-3.1-8B-Instruct \ --enable-lora \ --lora-modules company-model=/opt/lora/checkpoint-500 \ --port 8000
Connecting your app to vLLM — identical code to OpenAI:
from openai import OpenAIimport asyncioclient = OpenAI(base_url="http://localhost:8000/v1", api_key="vllm")# Batch processing — vLLM handles many parallel requestsasync def process_documents(docs: list[str]) -> list[str]: import httpx async with httpx.AsyncClient() as http: tasks = [ http.post("http://localhost:8000/v1/chat/completions", json={ "model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [{"role": "user", "content": f"Summarise: {doc}"}], "max_tokens": 200, }) for doc in docs ] results = await asyncio.gather(*tasks) return [r.json()["choices"][0]["message"]["content"] for r in results]
Configuration for different GPU setups:
- Single A100 80 GB: "--gpu-memory-utilization 0.90 --tensor-parallel-size 1"
- Two A100 40 GB cards (70B model): "--tensor-parallel-size 2"
- Four RTX 4090 24 GB cards (34B model): "--tensor-parallel-size 4"
Which GPU to choose and what does it cost?
/// GPU: VRAM vs MODEL
Which GPU for which model?
VRAM required for inference (full precision or 4-bit quantisation)
Practical rule: you need ~2× the VRAM the model occupies in your chosen precision.
What is quantisation (Q4, Q8, GGUF)?
Quantisation compresses model weights from full precision (16-bit, written bf16) to fewer bits — usually 8 (Q8) or 4 (Q4). It's the key technique that lets large models fit on consumer GPUs:
- bf16 (full precision) — an 8B model takes ~16 GB VRAM; best quality, needs an expensive GPU
- Q8 (8-bit) — ~8 GB for an 8B model; quality loss practically unmeasurable; the safe standard
- Q4 (4-bit) — ~5 GB for an 8B model; 1–3% quality loss on benchmarks; the most common self-hosted choice, because it fits a 2× larger model on the same GPU — and a larger Q4 model almost always beats a smaller bf16 one
- GGUF — the quantised-model file format used by llama.cpp and Ollama; download ready-made GGUF models from Hugging Face (look for the "-GGUF" suffix)
Practical rule: start with Q4. If you notice quality issues on your task — switch to Q8 or bf16 and compare on your golden dataset.
| GPU | VRAM | Cost | Max model (Q4 GGUF) | Throughput (~tok/s) |
|---|---|---|---|---|
| RTX 3060 | 12 GB | ~$300 | 7B | 25–35 |
| RTX 4090 | 24 GB | ~$1,600 | 13B or 7B bf16 | 60–90 |
| A10G 24 GB (cloud) | 24 GB | ~$1.5/h | 13B bf16 | 80–120 |
| A100 40 GB (cloud) | 40 GB | ~$2.5/h | 34B Q4 or 13B bf16 | 150–200 |
| A100 80 GB (cloud) | 80 GB | ~$3.5/h | 70B Q4 or 34B bf16 | 120–180 |
| H100 80 GB (cloud) | 80 GB | ~$4–8/h | 70B bf16 (fast) | 300–500 |
Cloud vs own GPU: For variable workloads (business hours) — cloud on-demand (RunPod, Lambda Labs, Vast.ai) is cheaper: you pay only for uptime. For constant 24/7 load — a dedicated machine pays for itself in 3–6 months. Calculator: monthly cloud cost = hourly rate × 720 h; if > $800 — your own server starts to win.
Which models to choose for self-hosting?
| Model | Size | VRAM Q4 | Strengths | Weaknesses |
|---|---|---|---|---|
| Llama 3.1 8B Instruct | 8B | ~5 GB | Best quality/VRAM, decent multilingual | Weaker reasoning than 70B |
| Llama 3.1 70B Instruct | 70B | ~40 GB | Close to GPT-4o-mini | Needs A100 |
| Mistral 7B v0.3 | 7B | ~4.5 GB | Fast, coding, EN | Weaker non-English |
| Phi-3 Medium 14B | 14B | ~8 GB | Great reasoning, small footprint | Short context (4k) |
| Qwen2.5 7B Instruct | 7B | ~5 GB | Multilingual, coding | Distinctive style |
| Qwen2.5 72B Instruct | 72B | ~42 GB | Best open-source 2025 | High GPU demands |
| Gemma 2 9B Instruct | 9B | ~5.5 GB | Google quality, safe | Smaller ecosystem |
Starting recommendation: begin with Llama 3.1 8B Instruct (Ollama: "ollama pull llama3.1:8b") — runs on an RTX 4090 or A10G, handles multiple languages, excellent quality-to-cost ratio. For 70B+ requirements — Qwen2.5 72B or Llama 3.1 70B on an A100 80 GB.
How to connect a self-hosted LLM to existing tools?
Both Ollama and vLLM expose an OpenAI-compatible REST API — meaning any tool that supports OpenAI works immediately, without modification:
- LangChain — ChatOpenAI(base_url="http://localhost:11434/v1", model="llama3.1:8b") — one line change
- LlamaIndex — OpenAI(api_base="http://localhost:8000/v1") — same pattern
- n8n — in the OpenAI node, change "Base URL" to your local server address
- Open WebUI — a browser-based GUI for Ollama, works like ChatGPT, available at localhost:3000 after install
- Continue.dev — VS Code plugin with a local LLM instead of Copilot; set model and apiBase in config.json
Monitoring and security for self-hosted LLM
A self-hosted LLM is not just a GPU and a model — you need observability and access control, just like with an external API.
Observability:
- Langfuse self-hosted (Docker Compose) — traces, tokens, latency, compute costs; works the same as with an external API
- Prometheus + Grafana — GPU metrics (temperature, VRAM usage, throughput) via nvidia-smi or DCGM exporter
- Your own JSON log with request_id, model, prompt_tokens, completion_tokens, latency_ms, user_id for every call
Access control:
- vLLM supports API key auth via "--api-key SECRET" — enforce this from day one
- Reverse proxy (nginx or Caddy) with SSL + rate limiting in front of vLLM; never expose the vLLM port directly to the internet
- Log every call with IP and user_id for auditing — GDPR compliance requires this
server { listen 443 ssl; server_name llm.yourcompany.com; ssl_certificate /etc/letsencrypt/live/llm.yourcompany.com/fullchain.pem; ssl_certificate_key /etc/letsencrypt/live/llm.yourcompany.com/privkey.pem; location /v1/ { proxy_pass http://127.0.0.1:8000; proxy_set_header X-Real-IP $remote_addr; limit_req zone=llm_limit burst=20 nodelay; }}limit_req_zone $binary_remote_addr zone=llm_limit:10m rate=30r/m;
Deployment checklist
- 1.Calculate TCO: API cost per month vs GPU cost + DevOps cost; if API < $400/mo — don't deploy your own model
- 2.Define compliance requirements — does the data really have to stay on-premises, or would a DPA with an API provider suffice?
- 3.Start with Ollama locally — verify the chosen model gives adequate quality on your data BEFORE buying a GPU
- 4.Choose GPU by the rule: VRAM ≥ 2× model size (Q4); for production with 5+ concurrent users — A10G or A100
- 5.For production: vLLM instead of Ollama — handles parallel requests and PagedAttention
- 6.Expose the API through nginx with SSL, API key auth and rate limiting; never directly to the internet
- 7.Deploy GPU monitoring (Prometheus + Grafana) and LLM traces (Langfuse self-hosted)
- 8.Run a regression test on your golden dataset after every model change
- 9.Plan rollback — keep the previous model version; switching should take < 5 minutes
- 10.Review your GDPR policy — even self-hosted deployments require a processing register for personal data the LLM handles
---
I help companies evaluate, choose and deploy self-hosted LLMs — from TCO analysis and GPU selection through vLLM and Ollama configuration to monitoring, security and integration with existing systems. Get in touch — I start with a free 30-minute analysis of your current AI stack and a break-even calculation.
/// RELATED_RECORDS
How AI Reads Invoices from Email and Enters Them into ERP
AI can automatically read an invoice from an email attachment — PDF, scan, or phone photo — and enter the data directly into an ERP system without any manual retyping. Full automation of cost invoice processing: from the mailbox to accounting.
Where to Start with AI Implementation in Your Company
AI implementation starts not with choosing a tool, but with identifying one repetitive process that wastes the most human time. Learn step by step how to select, map, and automate that process.
How to Build a Company Internal Knowledge Base with AI (RAG in Practice)
An internal knowledge base built on RAG lets you create your own company chatbot that answers only from your company's documents — not the model's guesses. Safe, up-to-date, precise AI with full control over your data.
Signal received?
Terminate
Silence
Initiate protocol. Establish connection. Let's build something loud.
