Can I fine-tune GPT-4o?

OpenAI makes fine-tuning available for GPT-4o-mini and older models, not for full GPT-4o. Training cost: $25/1M training tokens for GPT-4o-mini. The model is deployed through the standard API right after training completes. Downsides: vendor lock-in, expensive inference ($0.30/1M input), no access to weights or full architectural control. For most companies a better option is QLoRA on Llama 3.3 70B — comparable or better quality on narrow tasks, full control, lower cost at scale above 10M tokens/month.

How many training examples do I actually need?

It depends on task complexity. Simple classification (2–5 classes): 100–300 examples is enough. Style and tone change: 300–1,000. Complex multi-step generation: 1,000–3,000. Distilling GPT-4 behaviour to a smaller model: 5,000–50,000. Quality matters more than quantity: 300 perfect examples with QA beats 3,000 collected without verification. Start with 500, evaluate on the holdout set, scale data if you see overfitting or poor generalisation.

What is catastrophic forgetting and how do you prevent it?

Catastrophic forgetting is when a model loses general skills after fine-tuning on a narrow domain — e.g. after training on legal FAQs it starts struggling with general questions. Prevent it by: LoRA regularisation (lora_dropout 0.05–0.1), few epochs (3–5 max), low learning rate (1e-4 to 3e-4), adding 10–15% general examples to the training data. Always test the model on 50 out-of-domain examples after each training run.

What is the difference between fine-tuning and RAG and can you use both together?

Fine-tuning teaches the model how to respond (style, format, terminology). RAG provides what the model should say (current facts, documents). You can and should combine them: a fine-tuned model plus RAG retrieval gives consistent brand voice plus current factual knowledge in one architecture. Example: a bank fine-tunes a model for tone of voice and customer-service procedures, plus RAG with a live product database and exchange rates. This is the enterprise AI standard for 2025.

How long does QLoRA training take and what does it cost for a 7B model?

On an RTX 4090 (24 GB VRAM, ~$0.50/h on RunPod): 1,000 examples × 3 epochs = 2–3 h = $1–$1.50. 5,000 examples × 3 epochs = 8–10 h = $4–$6. Unsloth speeds this up 2–3×, so realistically: 5,000 examples in 4–5 h for $2–$3. On an A100 80 GB (~$2/h): the same data trains twice as fast. Full project cost from scratch (data + training + eval + Ollama or vLLM deployment): $50–$500 for a 7B model. For 70B QLoRA: $300–$1,500 with 10,000 examples.

How do I know when it is time to retrain the fine-tuned model?

Retrain when: (1) LLM-as-judge scores on your monitoring set drop by more than 5 percentage points from the post-training baseline, (2) a significant new product line, procedure or terminology is added that wasn't in the original training data, (3) the base model provider releases a meaningfully better new version. Practical cadence: re-evaluate every 3 months, retrain if needed. Keep training data versioned in Git so each retraining run is reproducible.

RETURN_TO_BLOG

2026-06-09AI & Automation 15 min

Fine-tuning an LLM on Company Data — When, How and What to Avoid

Paweł Wiszniewski

SEO & GEO Specialist · AI Engineer

Fine-tuning is worth doing when you have a style, tone or brand-voice problem — not a knowledge gap. If the model lacks your data (pricing, products, procedures), the solution is RAG, not fine-tuning. If the model knows the domain but expresses it in generic "assistant" language instead of yours — fine-tuning makes sense, provided you have at least 500 training pairs (input/output). In 80% of cases a better prompt with examples solves the problem faster and cheaper.

A practical guide to fine-tuning language models: when to choose FT over RAG or a better prompt, how to prepare JSONL data, QLoRA on an RTX 4090 with Unsloth, and how to measure whether training actually improved the model.

Your GPT-4o model answers correctly — but writes like a "generic assistant," not like a specialist from your firm. Terminology from the system prompt gets ignored. Product names are mangled. You have 50,000 archived support tickets that reflect exactly the style and expertise you want. Fine-tuning?

Maybe. The right question is different: do you have a style problem (the model knows but expresses it differently) — or a knowledge problem (the model doesn't know something it should)? The answer determines whether you need fine-tuning, RAG, a better prompt — or none of the above.

Criterion	Prompt Engineering	RAG	Fine-tuning
Goal	Better instruction	Add knowledge	Change model behaviour
One-time cost	~$0	$200–$2,000	$50–$5,000
Operating cost	Highest (long prompt)	Medium (retrieval)	Lowest (short prompt)
Time to deploy	Hours	Days	Days–weeks
Data needed	None	Documents	500–5,000 JSONL pairs
Data currency	Yes (in prompt)	Yes (retrieval)	No (training cutoff)
When it wins	Prototype, frequent changes	Large knowledge bases	Consistent style, terminology

When does fine-tuning actually make sense?

Fine-tuning is the right choice when at least one of the following conditions is true:

Consistent brand voice — the model should always write like your best specialist, not like a generic assistant; a good prompt can't guarantee this consistently across long conversations
Specialist domain terminology — proper nouns, acronyms and procedures the base model systematically gets wrong or fails to recognise
Strictly defined output format — report always in 5 sections, JSON always with fields X and Y; a "please write in format..." prompt fails when zero structure errors is a business requirement
Token cost reduction — a 3,000-token system prompt in every call costs money; fine-tuning bakes in the instructions, cutting prompt length by 60–80%
Privacy through distillation — copying commercial model behaviour to a self-hosted open-source model that never sends data outside the company infrastructure
Fast classification or routing — a fine-tuned 7B model for one task is 10× cheaper and faster than GPT-4o on the same task

When is fine-tuning the wrong choice?

You need current information — the model has a training cutoff; data from the last 6 months requires RAG, not fine-tuning
You want to add factual knowledge — "teach the model our product catalogue" is a RAG problem; fine-tuning teaches response format but facts from a small dataset are prone to hallucination
You have fewer than 300 examples — too little data leads to catastrophic forgetting and overfitting; the model loses general skills
A better prompt hasn't been tried — start with an optimal system prompt with 5–10 few-shot examples; this solves 80% of style problems with zero GPU cost
Requirements change every week — updating a fine-tuned model means a new training cycle; a prompt can be changed in 5 minutes

Types of fine-tuning in 2025

/// WHEN TO CHOOSE WHICH METHOD?

Prompt Engineering vs RAG vs Fine-tuning

Prompt EngineeringFAST

When to choose:

✓Quick prototype or MVP
✓Changing requirements
✓Standard response style
✓No access to data

When NOT:

✗Consistent brand tone is required
✗Model mishandles terminology

COST

~$0

TIME

Hours

DATA

None

RAGKNOWLEDGE

When to choose:

✓Up-to-date factual knowledge
✓Large document bases
✓Verifiable sources
✓Data changes frequently

When NOT:

✗Consistent tone matters more than knowledge
✗Latency < 500 ms is a requirement

COST

$200–$2 000

TIME

Days

DATA

Documents

Fine-tuningBEHAVIOR

When to choose:

✓Consistent style and brand tone
✓Specialized terminology
✓Specific output format
✓Token cost reduction

When NOT:

✗You need up-to-date data
✗You have < 300 examples

COST

$50–$5 000

TIME

Days–weeks

DATA

500–5 000 pairs

The fine-tuning ecosystem has five main approaches:

Full fine-tuning — updates all model weights; best quality, but requires tens of GB of VRAM across multiple GPUs; reserved for AI labs with multi-million-dollar infrastructure
SFT (Supervised Fine-Tuning) — learning from instruction→response pairs; the business standard; almost always used with LoRA or QLoRA rather than full FT
LoRA (Low-Rank Adaptation) — adds small A×B matrices to attention layers instead of updating all weights; trains 0.1–1% of parameters at quality close to full FT with 90% less VRAM
QLoRA (Quantized LoRA) — LoRA plus 4-bit quantisation of the base model; enables training a 7B model on an RTX 4090 (24 GB) and 13B on an A100 (40 GB); the 2025 industry standard for companies without a GPU data centre
DPO (Direct Preference Optimization) — learns from good/bad response pairs instead of RLHF reward models; eliminates unwanted model behaviours; used for alignment and safety

How to prepare training data

Data quality matters more than quantity — 500 perfect examples will beat 10,000 mediocre ones.

JSONL format:

Each line is one JSON example with fields: system (system instruction), input (question or task), output (expected response)
Alternatively: messages-based format with roles (compatible with the OpenAI fine-tuning API)
Encoding: UTF-8 without BOM, national characters without escaping

How many examples you need:

Simple classification (2–5 classes): 100–300 examples is enough
Style and tone change: 300–1,000 examples
Complex multi-step generation: 1,000–5,000 examples
Distilling GPT-4 behaviour into a smaller model: 5,000–50,000 examples

How to collect data:

Export and filter historical support responses rated by QA — the best source because it reflects real conversation patterns
Synthetic generation by GPT-4o with instruction "generate 50 input/output pairs for task X in style Y," verified by humans on a 20% sample
Manual annotation by domain experts for critical applications (law, medicine, finance)
Augmentation: paraphrases of the same questions improve model robustness to input variations

What to avoid in your data:

Duplicates and near-duplicates (similarity > 30%) — lead to overfitting on specific phrases
Contradictory examples (same input, different output) — confuse the model during gradient steps
No train/validation split (90/10) — without a holdout set you can't detect overfitting in time

LoRA and QLoRA — fine-tuning on a single GPU

/// VRAM REQUIREMENTS BY TRAINING METHOD

GPU memory requirements

Minimum VRAM for a typical training run (batch 2, sequence 2048 tokens)

7B — Full fine-tuning

112 GB

A100 ×2

13B — Full fine-tuning

200 GB

A100 ×4

7B — LoRA (bf16)

28 GB

A100 40 GB

13B — LoRA (bf16)

52 GB

A100 80 GB

7B — QLoRA (4-bit)

8 GB

RTX 4090 ✓

13B — QLoRA (4-bit)

12 GB

RTX 4090 ✓

70B — QLoRA (4-bit)

48 GB

A100 80 GB

Impractical for companies Expensive GPU (~$2–3/h) Consumer GPU (~$0.50/h) Accessible, requires datacenter

LoRA is a mathematical trick: instead of modifying weight matrix W (e.g. 4096×4096 = 16M parameters), it adds two small matrices: A (4096×rank) and B (rank×4096). Only A and B are trained. At rank=16 the parameter count drops from 16M to 131K — 122× fewer — at quality close to full fine-tuning on narrow tasks.

QLoRA adds 4-bit quantisation of the base model (NF4 format from the bitsandbytes library). Effect: a 7B model that normally takes 14 GB in bf16 takes 4 GB quantised, plus ~0.5 GB for LoRA adapters. An RTX 4090 with 24 GB VRAM comfortably handles training with batch 2 and 2,048-token sequences.

finetune_qlora.py

# finetune_qlora.py — QLoRA with Unsloth (2-3x faster than standard HF Trainer)from unsloth import FastLanguageModelfrom trl import SFTTrainerfrom transformers import TrainingArgumentsfrom datasets import load_datasetMAX_SEQ_LEN = 2048RANK = 16model, tokenizer = FastLanguageModel.from_pretrained(    model_name="unsloth/llama-3-8b-bnb-4bit",    max_seq_length=MAX_SEQ_LEN,    load_in_4bit=True,)model = FastLanguageModel.get_peft_model(    model,    r=RANK,    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],    lora_alpha=RANK * 2,    lora_dropout=0.05,    bias="none",    use_gradient_checkpointing=True,)dataset = load_dataset("json", data_files="train.jsonl", split="train")TEMPLATE = "<|system|>\n{system}\n<|user|>\n{input}\n<|assistant|>\n{output}"def format_chat(ex):    return {"text": TEMPLATE.format(**ex)}dataset = dataset.map(format_chat)trainer = SFTTrainer(    model=model,    tokenizer=tokenizer,    train_dataset=dataset,    dataset_text_field="text",    max_seq_length=MAX_SEQ_LEN,    args=TrainingArguments(        output_dir="./output",        num_train_epochs=3,        per_device_train_batch_size=2,        gradient_accumulation_steps=4,        learning_rate=2e-4,        fp16=True,        logging_steps=10,        save_steps=200,        warmup_ratio=0.03,        lr_scheduler_type="cosine",        report_to="none",    ),)trainer.train()model.save_pretrained_merged("./my-model", tokenizer, save_method="merged_16bit")

Unsloth speeds up training 2–3× versus the standard HuggingFace Trainer and reduces VRAM usage by a further 50–60%. The finished model can be loaded via Ollama (local deployment and testing), vLLM (high-throughput production API) or Hugging Face Inference Endpoints.

Where to train and what does it cost?

Platform	Base model	GPU	Cost 1,000 examples × 3 epochs	Deployment
OpenAI FT API	GPT-4o-mini	Managed	$6–$15	API immediately
HuggingFace + RunPod	Llama / Qwen	RTX 4090 (24 GB)	$1–$3	Inference Endpoints
Lambda Labs	Llama / Qwen	A100 (80 GB)	$8–$15	Self-deploy / vLLM
Google Vertex AI	Gemma / custom	Managed TPU	$20–$50	Vertex AI Prediction
Modal.com	Llama / Mistral	A10G on-demand	$3–$8	Serverless API
Unsloth + Colab Pro	Llama 3 up to 13B	T4 / A100	$0 (GPU limit)	Export GGUF to Ollama

How do you evaluate whether fine-tuning improved the model?

A subjective "it looks better" is not enough in a production environment. You need measurable metrics:

LLM-as-judge — GPT-4o rates 1–5 each response from the fine-tuned model vs the base model on a 100-example holdout set; cheap, scalable and highly correlated with human ratings
Task accuracy — for classification: accuracy and F1; for data extraction: precision and recall on key fields
Regression test — 50 examples outside the fine-tuning domain; verify the model hasn't lost general skills (catastrophic forgetting)
Latency and cost — does the shorter system prompt after fine-tuning actually reduce call time and cost by the planned 40–60%?
Hallucination rate — if the model outputs facts: what percentage of responses contain fabricated information before and after fine-tuning?

Most common fine-tuning mistakes

Fine-tuning instead of a good prompt — 80% of style and format problems are solved by an optimal system prompt with 5 few-shot examples; check this before spending time and money on training
Too little data = overfitting — the model perfectly reproduces training data but fails on new inputs; symptom: validation loss rises while training loss falls
Too many epochs — 3–5 epochs is usually optimal for SFT; 10+ epochs leads to overfitting and degradation of general model skills
Training on unverified data — the model will learn errors and inaccuracies from training data; every example needs human or LLM-judge verification
No regression tests — fine-tuning on a narrow domain can distort responses outside that domain; always test general questions after training
Vendor lock-in of training data — training data is your strategic asset; store it in vendor-agnostic JSONL and train parallel open-source models as a backup
No post-deployment monitoring — a fine-tuned model can degrade as the input distribution shifts; monitor quality metrics regularly and plan retraining cycles

Checklist before fine-tuning

1.Check whether a better prompt or few-shot examples solves the problem — invest 2–4 hours before going further
2.Check whether RAG is a better solution (if it is a factual knowledge problem, not a style problem — use RAG instead)
3.Collect at minimum 500 input/output pairs with QA on each example in JSONL format
4.Split the data: 90% training, 10% validation, plus 50–100 holdout examples outside the split
5.Choose a platform based on budget and data privacy requirements (GDPR/HIPAA especially)
6.Start with a 7B model — fine-tuning 7B is 3× cheaper than 13B at similar quality on narrow tasks
7.After training, run an A/B test on the holdout set with LLM-as-judge: fine-tuned vs base model vs model with a better prompt
8.Plan a retraining cycle every 3–6 months as the domain and input patterns evolve

---

I help companies decide whether fine-tuning is the right step — and if so, I run the entire process: from data auditing and preparation through QLoRA training to deployment and quality monitoring. Get in touch — I start with a free 30-minute analysis of your use case.

/// RELATED_SERVICES

Need these concepts implemented? Explore the services related to this topic.

Service

AI & Automation

Virtual employees who never sleep. Autonomous agents and workflows.

View service Service

AI App Development

Custom AI software and AI-powered web applications. MVP development, full stack engineering, and AI systems programming from scratch to production.

View service

/// SOURCES

/// RELATED_RECORDS

AI & Automation

Vibe Coding: Complete Guide to AI Coding Tools 2026

Claude Code, Cursor, GitHub Copilot, Codex CLI, Gemini CLI, Lovable, Bolt.new — 60% of all new code worldwide is AI-generated (Gartner, 2026). A complete map of 11 vibe coding tools across 3 categories, with pricing, use cases, and a selection guide for businesses.

18 min

AI & Automation

AI Deep Research — How an Agent Searches the Web and Writes the Report Instead of Your Analyst

OpenAI Deep Research, Perplexity, and web-browsing agents are reshaping desk research: a report that takes an analyst 4–8 hours, an agent finishes in 5–20 minutes with source citations. I explain how these tools work, when they genuinely replace a human and when they don't, what ROI looks like, how to build your own research-automation pipeline, and when it makes sense to let the agent do it instead of an employee.

15 min

AI & Automation

AI in Recruitment and HR 2026 — CV Screening Automation, EU AI Act Obligations, and When AI Helps vs Hurts

AI cuts CV screening time by 75%, but recruitment systems are classified as high-risk AI under the EU AI Act — with a full compliance package: human oversight, transparency, technical documentation, EU database registration. I explain what AI in HR can safely do (screening as a filter, chatbot, onboarding), where the line is (autonomous decisions without a human), which tools work for SMEs, and how to avoid legal exposure.

17 min

/// AUTHOR

Paweł Wiszniewski

SEO & GEO Specialist & AI Engineer

SEO/GEO specialist (10 years) and AI engineer (3 years). I build search visibility, AI systems and automations that reduce costs and improve operational efficiency.

LinkedIn Facebook

Signal received?

Terminate
Silence

Initiate protocol. Establish connection. Let's build something loud.

> WAITING_FOR_INPUT...

BIAŁYSTOK, PL

+48 732 022 086 pawel.wiszniewski95@gmail.com

When does fine-tuning actually make sense?

When is fine-tuning the wrong choice?

Types of fine-tuning in 2025

Prompt Engineering vs RAG vs Fine-tuning

How to prepare training data

LoRA and QLoRA — fine-tuning on a single GPU

GPU memory requirements

Where to train and what does it cost?

How do you evaluate whether fine-tuning improved the model?

Most common fine-tuning mistakes

Checklist before fine-tuning

/// RELATED_SERVICES

AI & Automation

AI App Development

/// SOURCES

/// RELATED_RECORDS

Vibe Coding: Complete Guide to AI Coding Tools 2026

AI Deep Research — How an Agent Searches the Web and Writes the Report Instead of Your Analyst

AI in Recruitment and HR 2026 — CV Screening Automation, EU AI Act Obligations, and When AI Helps vs Hurts

Signal received?

TerminateSilence

Terminate
Silence