Fine-tuning an LLM on Company Data — When, How and What to Avoid
A practical guide to fine-tuning language models: when to choose FT over RAG or a better prompt, how to prepare JSONL data, QLoRA on an RTX 4090 with Unsloth, and how to measure whether training actually improved the model.
Fine-tuning is worth doing when you have a style, tone or brand-voice problem — not a knowledge gap. If the model lacks your data (pricing, products, procedures), the solution is RAG, not fine-tuning. If the model knows the domain but expresses it in generic "assistant" language instead of yours — fine-tuning makes sense, provided you have at least 500 training pairs (input/output). In 80% of cases a better prompt with examples solves the problem faster and cheaper.
Your GPT-4o model answers correctly — but writes like a "generic assistant," not like a specialist from your firm. Terminology from the system prompt gets ignored. Product names are mangled. You have 50,000 archived support tickets that reflect exactly the style and expertise you want. Fine-tuning?
Maybe. The right question is different: do you have a style problem (the model knows but expresses it differently) — or a knowledge problem (the model doesn't know something it should)? The answer determines whether you need fine-tuning, RAG, a better prompt — or none of the above.
| Criterion | Prompt Engineering | RAG | Fine-tuning |
|---|---|---|---|
| Goal | Better instruction | Add knowledge | Change model behaviour |
| One-time cost | ~$0 | $200–$2,000 | $50–$5,000 |
| Operating cost | Highest (long prompt) | Medium (retrieval) | Lowest (short prompt) |
| Time to deploy | Hours | Days | Days–weeks |
| Data needed | None | Documents | 500–5,000 JSONL pairs |
| Data currency | Yes (in prompt) | Yes (retrieval) | No (training cutoff) |
| When it wins | Prototype, frequent changes | Large knowledge bases | Consistent style, terminology |
When does fine-tuning actually make sense?
Fine-tuning is the right choice when at least one of the following conditions is true:
- Consistent brand voice — the model should always write like your best specialist, not like a generic assistant; a good prompt can't guarantee this consistently across long conversations
- Specialist domain terminology — proper nouns, acronyms and procedures the base model systematically gets wrong or fails to recognise
- Strictly defined output format — report always in 5 sections, JSON always with fields X and Y; a "please write in format..." prompt fails when zero structure errors is a business requirement
- Token cost reduction — a 3,000-token system prompt in every call costs money; fine-tuning bakes in the instructions, cutting prompt length by 60–80%
- Privacy through distillation — copying commercial model behaviour to a self-hosted open-source model that never sends data outside the company infrastructure
- Fast classification or routing — a fine-tuned 7B model for one task is 10× cheaper and faster than GPT-4o on the same task
When is fine-tuning the wrong choice?
- You need current information — the model has a training cutoff; data from the last 6 months requires RAG, not fine-tuning
- You want to add factual knowledge — "teach the model our product catalogue" is a RAG problem; fine-tuning teaches response format but facts from a small dataset are prone to hallucination
- You have fewer than 300 examples — too little data leads to catastrophic forgetting and overfitting; the model loses general skills
- A better prompt hasn't been tried — start with an optimal system prompt with 5–10 few-shot examples; this solves 80% of style problems with zero GPU cost
- Requirements change every week — updating a fine-tuned model means a new training cycle; a prompt can be changed in 5 minutes
Types of fine-tuning in 2025
/// KIEDY WYBRAĆ KTÓRĄ METODĘ?
Prompt Engineering vs RAG vs Fine-tuning
Kiedy wybrać:
- ✓Szybki prototyp lub MVP
- ✓Zmieniające się wymagania
- ✓Standardowy styl odpowiedzi
- ✓Brak dostępu do danych
Kiedy NIE:
- ✗Spójny ton marki jest wymogiem
- ✗Model błędnie obsługuje terminologię
KOSZT
~$0
CZAS
Godziny
DANE
Brak
Kiedy wybrać:
- ✓Aktualna wiedza faktyczna
- ✓Duże bazy dokumentów
- ✓Weryfikowalne źródła
- ✓Dane zmieniają się często
Kiedy NIE:
- ✗Stały ton ważniejszy niż wiedza
- ✗Latency < 500 ms jest wymaganiem
KOSZT
$200–$2 000
CZAS
Dni
DANE
Dokumenty
Kiedy wybrać:
- ✓Stały styl i ton marki
- ✓Specjalistyczna terminologia
- ✓Konkretny format wyjścia
- ✓Redukcja kosztu tokenów
Kiedy NIE:
- ✗Potrzebujesz aktualnych danych
- ✗Masz < 300 przykładów
KOSZT
$50–$5 000
CZAS
Dni–tygodnie
DANE
500–5 000 par
The fine-tuning ecosystem has five main approaches:
- Full fine-tuning — updates all model weights; best quality, but requires tens of GB of VRAM across multiple GPUs; reserved for AI labs with multi-million-dollar infrastructure
- SFT (Supervised Fine-Tuning) — learning from instruction→response pairs; the business standard; almost always used with LoRA or QLoRA rather than full FT
- LoRA (Low-Rank Adaptation) — adds small A×B matrices to attention layers instead of updating all weights; trains 0.1–1% of parameters at quality close to full FT with 90% less VRAM
- QLoRA (Quantized LoRA) — LoRA plus 4-bit quantisation of the base model; enables training a 7B model on an RTX 4090 (24 GB) and 13B on an A100 (40 GB); the 2025 industry standard for companies without a GPU data centre
- DPO (Direct Preference Optimization) — learns from good/bad response pairs instead of RLHF reward models; eliminates unwanted model behaviours; used for alignment and safety
How to prepare training data
Data quality matters more than quantity — 500 perfect examples will beat 10,000 mediocre ones.
JSONL format:
- Each line is one JSON example with fields: system (system instruction), input (question or task), output (expected response)
- Alternatively: messages-based format with roles (compatible with the OpenAI fine-tuning API)
- Encoding: UTF-8 without BOM, national characters without escaping
How many examples you need:
- Simple classification (2–5 classes): 100–300 examples is enough
- Style and tone change: 300–1,000 examples
- Complex multi-step generation: 1,000–5,000 examples
- Distilling GPT-4 behaviour into a smaller model: 5,000–50,000 examples
How to collect data:
- Export and filter historical support responses rated by QA — the best source because it reflects real conversation patterns
- Synthetic generation by GPT-4o with instruction "generate 50 input/output pairs for task X in style Y," verified by humans on a 20% sample
- Manual annotation by domain experts for critical applications (law, medicine, finance)
- Augmentation: paraphrases of the same questions improve model robustness to input variations
What to avoid in your data:
- Duplicates and near-duplicates (similarity > 30%) — lead to overfitting on specific phrases
- Contradictory examples (same input, different output) — confuse the model during gradient steps
- No train/validation split (90/10) — without a holdout set you can't detect overfitting in time
LoRA and QLoRA — fine-tuning on a single GPU
/// WYMAGANIA VRAM WEDŁUG METODY TRENINGOWEJ
Zapotrzebowanie na pamięć GPU
Minimalne VRAM dla typowego treningu (batch 2, sekwencja 2048 tokenów)
LoRA is a mathematical trick: instead of modifying weight matrix W (e.g. 4096×4096 = 16M parameters), it adds two small matrices: A (4096×rank) and B (rank×4096). Only A and B are trained. At rank=16 the parameter count drops from 16M to 131K — 122× fewer — at quality close to full fine-tuning on narrow tasks.
QLoRA adds 4-bit quantisation of the base model (NF4 format from the bitsandbytes library). Effect: a 7B model that normally takes 14 GB in bf16 takes 4 GB quantised, plus ~0.5 GB for LoRA adapters. An RTX 4090 with 24 GB VRAM comfortably handles training with batch 2 and 2,048-token sequences.
# finetune_qlora.py — QLoRA with Unsloth (2-3x faster than standard HF Trainer)from unsloth import FastLanguageModelfrom trl import SFTTrainerfrom transformers import TrainingArgumentsfrom datasets import load_datasetMAX_SEQ_LEN = 2048RANK = 16model, tokenizer = FastLanguageModel.from_pretrained( model_name="unsloth/llama-3-8b-bnb-4bit", max_seq_length=MAX_SEQ_LEN, load_in_4bit=True,)model = FastLanguageModel.get_peft_model( model, r=RANK, target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], lora_alpha=RANK * 2, lora_dropout=0.05, bias="none", use_gradient_checkpointing=True,)dataset = load_dataset("json", data_files="train.jsonl", split="train")TEMPLATE = "<|system|>\n{system}\n<|user|>\n{input}\n<|assistant|>\n{output}"def format_chat(ex): return {"text": TEMPLATE.format(**ex)}dataset = dataset.map(format_chat)trainer = SFTTrainer( model=model, tokenizer=tokenizer, train_dataset=dataset, dataset_text_field="text", max_seq_length=MAX_SEQ_LEN, args=TrainingArguments( output_dir="./output", num_train_epochs=3, per_device_train_batch_size=2, gradient_accumulation_steps=4, learning_rate=2e-4, fp16=True, logging_steps=10, save_steps=200, warmup_ratio=0.03, lr_scheduler_type="cosine", report_to="none", ),)trainer.train()model.save_pretrained_merged("./my-model", tokenizer, save_method="merged_16bit")
Unsloth speeds up training 2–3× versus the standard HuggingFace Trainer and reduces VRAM usage by a further 50–60%. The finished model can be loaded via Ollama (local deployment and testing), vLLM (high-throughput production API) or Hugging Face Inference Endpoints.
Where to train and what does it cost?
| Platform | Base model | GPU | Cost 1,000 examples × 3 epochs | Deployment |
|---|---|---|---|---|
| OpenAI FT API | GPT-4o-mini | Managed | $6–$15 | API immediately |
| HuggingFace + RunPod | Llama / Qwen | RTX 4090 (24 GB) | $1–$3 | Inference Endpoints |
| Lambda Labs | Llama / Qwen | A100 (80 GB) | $8–$15 | Self-deploy / vLLM |
| Google Vertex AI | Gemma / custom | Managed TPU | $20–$50 | Vertex AI Prediction |
| Modal.com | Llama / Mistral | A10G on-demand | $3–$8 | Serverless API |
| Unsloth + Colab Pro | Llama 3 up to 13B | T4 / A100 | $0 (GPU limit) | Export GGUF to Ollama |
How do you evaluate whether fine-tuning improved the model?
A subjective "it looks better" is not enough in a production environment. You need measurable metrics:
- LLM-as-judge — GPT-4o rates 1–5 each response from the fine-tuned model vs the base model on a 100-example holdout set; cheap, scalable and highly correlated with human ratings
- Task accuracy — for classification: accuracy and F1; for data extraction: precision and recall on key fields
- Regression test — 50 examples outside the fine-tuning domain; verify the model hasn't lost general skills (catastrophic forgetting)
- Latency and cost — does the shorter system prompt after fine-tuning actually reduce call time and cost by the planned 40–60%?
- Hallucination rate — if the model outputs facts: what percentage of responses contain fabricated information before and after fine-tuning?
Most common fine-tuning mistakes
- Fine-tuning instead of a good prompt — 80% of style and format problems are solved by an optimal system prompt with 5 few-shot examples; check this before spending time and money on training
- Too little data = overfitting — the model perfectly reproduces training data but fails on new inputs; symptom: validation loss rises while training loss falls
- Too many epochs — 3–5 epochs is usually optimal for SFT; 10+ epochs leads to overfitting and degradation of general model skills
- Training on unverified data — the model will learn errors and inaccuracies from training data; every example needs human or LLM-judge verification
- No regression tests — fine-tuning on a narrow domain can distort responses outside that domain; always test general questions after training
- Vendor lock-in of training data — training data is your strategic asset; store it in vendor-agnostic JSONL and train parallel open-source models as a backup
- No post-deployment monitoring — a fine-tuned model can degrade as the input distribution shifts; monitor quality metrics regularly and plan retraining cycles
Checklist before fine-tuning
- 1.Check whether a better prompt or few-shot examples solves the problem — invest 2–4 hours before going further
- 2.Check whether RAG is a better solution (if it is a factual knowledge problem, not a style problem — use RAG instead)
- 3.Collect at minimum 500 input/output pairs with QA on each example in JSONL format
- 4.Split the data: 90% training, 10% validation, plus 50–100 holdout examples outside the split
- 5.Choose a platform based on budget and data privacy requirements (GDPR/HIPAA especially)
- 6.Start with a 7B model — fine-tuning 7B is 3× cheaper than 13B at similar quality on narrow tasks
- 7.After training, run an A/B test on the holdout set with LLM-as-judge: fine-tuned vs base model vs model with a better prompt
- 8.Plan a retraining cycle every 3–6 months as the domain and input patterns evolve
---
I help companies decide whether fine-tuning is the right step — and if so, I run the entire process: from data auditing and preparation through QLoRA training to deployment and quality monitoring. Get in touch — I start with a free 30-minute analysis of your use case.
/// RELATED_RECORDS
How AI Reads Invoices from Email and Enters Them into ERP
AI can automatically read an invoice from an email attachment — PDF, scan, or phone photo — and enter the data directly into an ERP system without any manual retyping. Full automation of cost invoice processing: from the mailbox to accounting.
Where to Start with AI Implementation in Your Company
AI implementation starts not with choosing a tool, but with identifying one repetitive process that wastes the most human time. Learn step by step how to select, map, and automate that process.
How to Build a Company Internal Knowledge Base with AI (RAG in Practice)
An internal knowledge base built on RAG lets you create your own company chatbot that answers only from your company's documents — not the model's guesses. Safe, up-to-date, precise AI with full control over your data.
Signal received?
Terminate
Silence
Initiate protocol. Establish connection. Let's build something loud.
