RETURN_TO_BLOG
AI & Automation 15 min

Fine-tuning an LLM on Company Data — When, How and What to Avoid

A practical guide to fine-tuning language models: when to choose FT over RAG or a better prompt, how to prepare JSONL data, QLoRA on an RTX 4090 with Unsloth, and how to measure whether training actually improved the model.

Fine-tuning is worth doing when you have a style, tone or brand-voice problem — not a knowledge gap. If the model lacks your data (pricing, products, procedures), the solution is RAG, not fine-tuning. If the model knows the domain but expresses it in generic "assistant" language instead of yours — fine-tuning makes sense, provided you have at least 500 training pairs (input/output). In 80% of cases a better prompt with examples solves the problem faster and cheaper.

Your GPT-4o model answers correctly — but writes like a "generic assistant," not like a specialist from your firm. Terminology from the system prompt gets ignored. Product names are mangled. You have 50,000 archived support tickets that reflect exactly the style and expertise you want. Fine-tuning?

Maybe. The right question is different: do you have a style problem (the model knows but expresses it differently) — or a knowledge problem (the model doesn't know something it should)? The answer determines whether you need fine-tuning, RAG, a better prompt — or none of the above.

CriterionPrompt EngineeringRAGFine-tuning
GoalBetter instructionAdd knowledgeChange model behaviour
One-time cost~$0$200–$2,000$50–$5,000
Operating costHighest (long prompt)Medium (retrieval)Lowest (short prompt)
Time to deployHoursDaysDays–weeks
Data neededNoneDocuments500–5,000 JSONL pairs
Data currencyYes (in prompt)Yes (retrieval)No (training cutoff)
When it winsPrototype, frequent changesLarge knowledge basesConsistent style, terminology

When does fine-tuning actually make sense?

Fine-tuning is the right choice when at least one of the following conditions is true:

  • Consistent brand voice — the model should always write like your best specialist, not like a generic assistant; a good prompt can't guarantee this consistently across long conversations
  • Specialist domain terminology — proper nouns, acronyms and procedures the base model systematically gets wrong or fails to recognise
  • Strictly defined output format — report always in 5 sections, JSON always with fields X and Y; a "please write in format..." prompt fails when zero structure errors is a business requirement
  • Token cost reduction — a 3,000-token system prompt in every call costs money; fine-tuning bakes in the instructions, cutting prompt length by 60–80%
  • Privacy through distillation — copying commercial model behaviour to a self-hosted open-source model that never sends data outside the company infrastructure
  • Fast classification or routing — a fine-tuned 7B model for one task is 10× cheaper and faster than GPT-4o on the same task

When is fine-tuning the wrong choice?

  • You need current information — the model has a training cutoff; data from the last 6 months requires RAG, not fine-tuning
  • You want to add factual knowledge — "teach the model our product catalogue" is a RAG problem; fine-tuning teaches response format but facts from a small dataset are prone to hallucination
  • You have fewer than 300 examples — too little data leads to catastrophic forgetting and overfitting; the model loses general skills
  • A better prompt hasn't been tried — start with an optimal system prompt with 5–10 few-shot examples; this solves 80% of style problems with zero GPU cost
  • Requirements change every week — updating a fine-tuned model means a new training cycle; a prompt can be changed in 5 minutes

Types of fine-tuning in 2025

/// KIEDY WYBRAĆ KTÓRĄ METODĘ?

Prompt Engineering vs RAG vs Fine-tuning

Prompt EngineeringSZYBKO

Kiedy wybrać:

  • Szybki prototyp lub MVP
  • Zmieniające się wymagania
  • Standardowy styl odpowiedzi
  • Brak dostępu do danych

Kiedy NIE:

  • Spójny ton marki jest wymogiem
  • Model błędnie obsługuje terminologię

KOSZT

~$0

CZAS

Godziny

DANE

Brak

RAGWIEDZA

Kiedy wybrać:

  • Aktualna wiedza faktyczna
  • Duże bazy dokumentów
  • Weryfikowalne źródła
  • Dane zmieniają się często

Kiedy NIE:

  • Stały ton ważniejszy niż wiedza
  • Latency < 500 ms jest wymaganiem

KOSZT

$200–$2 000

CZAS

Dni

DANE

Dokumenty

Fine-tuningZACHOWANIE

Kiedy wybrać:

  • Stały styl i ton marki
  • Specjalistyczna terminologia
  • Konkretny format wyjścia
  • Redukcja kosztu tokenów

Kiedy NIE:

  • Potrzebujesz aktualnych danych
  • Masz < 300 przykładów

KOSZT

$50–$5 000

CZAS

Dni–tygodnie

DANE

500–5 000 par

The fine-tuning ecosystem has five main approaches:

  • Full fine-tuning — updates all model weights; best quality, but requires tens of GB of VRAM across multiple GPUs; reserved for AI labs with multi-million-dollar infrastructure
  • SFT (Supervised Fine-Tuning) — learning from instruction→response pairs; the business standard; almost always used with LoRA or QLoRA rather than full FT
  • LoRA (Low-Rank Adaptation) — adds small A×B matrices to attention layers instead of updating all weights; trains 0.1–1% of parameters at quality close to full FT with 90% less VRAM
  • QLoRA (Quantized LoRA) — LoRA plus 4-bit quantisation of the base model; enables training a 7B model on an RTX 4090 (24 GB) and 13B on an A100 (40 GB); the 2025 industry standard for companies without a GPU data centre
  • DPO (Direct Preference Optimization) — learns from good/bad response pairs instead of RLHF reward models; eliminates unwanted model behaviours; used for alignment and safety

How to prepare training data

Data quality matters more than quantity — 500 perfect examples will beat 10,000 mediocre ones.

JSONL format:

  • Each line is one JSON example with fields: system (system instruction), input (question or task), output (expected response)
  • Alternatively: messages-based format with roles (compatible with the OpenAI fine-tuning API)
  • Encoding: UTF-8 without BOM, national characters without escaping

How many examples you need:

  • Simple classification (2–5 classes): 100–300 examples is enough
  • Style and tone change: 300–1,000 examples
  • Complex multi-step generation: 1,000–5,000 examples
  • Distilling GPT-4 behaviour into a smaller model: 5,000–50,000 examples

How to collect data:

  • Export and filter historical support responses rated by QA — the best source because it reflects real conversation patterns
  • Synthetic generation by GPT-4o with instruction "generate 50 input/output pairs for task X in style Y," verified by humans on a 20% sample
  • Manual annotation by domain experts for critical applications (law, medicine, finance)
  • Augmentation: paraphrases of the same questions improve model robustness to input variations

What to avoid in your data:

  • Duplicates and near-duplicates (similarity > 30%) — lead to overfitting on specific phrases
  • Contradictory examples (same input, different output) — confuse the model during gradient steps
  • No train/validation split (90/10) — without a holdout set you can't detect overfitting in time

LoRA and QLoRA — fine-tuning on a single GPU

/// WYMAGANIA VRAM WEDŁUG METODY TRENINGOWEJ

Zapotrzebowanie na pamięć GPU

Minimalne VRAM dla typowego treningu (batch 2, sekwencja 2048 tokenów)

7B — Full fine-tuning
112 GB
A100 ×2
13B — Full fine-tuning
200 GB
A100 ×4
7B — LoRA (bf16)
28 GB
A100 40 GB
13B — LoRA (bf16)
52 GB
A100 80 GB
7B — QLoRA (4-bit)
8 GB
RTX 4090 ✓
13B — QLoRA (4-bit)
12 GB
RTX 4090 ✓
70B — QLoRA (4-bit)
48 GB
A100 80 GB
Niepraktyczne dla firm Drogie GPU (~$2–3/h) Konsumencki GPU (~$0.50/h) Dostępne, wymaga datacenter

LoRA is a mathematical trick: instead of modifying weight matrix W (e.g. 4096×4096 = 16M parameters), it adds two small matrices: A (4096×rank) and B (rank×4096). Only A and B are trained. At rank=16 the parameter count drops from 16M to 131K — 122× fewer — at quality close to full fine-tuning on narrow tasks.

QLoRA adds 4-bit quantisation of the base model (NF4 format from the bitsandbytes library). Effect: a 7B model that normally takes 14 GB in bf16 takes 4 GB quantised, plus ~0.5 GB for LoRA adapters. An RTX 4090 with 24 GB VRAM comfortably handles training with batch 2 and 2,048-token sequences.

finetune_qlora.py
# finetune_qlora.py — QLoRA with Unsloth (2-3x faster than standard HF Trainer)from unsloth import FastLanguageModelfrom trl import SFTTrainerfrom transformers import TrainingArgumentsfrom datasets import load_datasetMAX_SEQ_LEN = 2048RANK = 16model, tokenizer = FastLanguageModel.from_pretrained(    model_name="unsloth/llama-3-8b-bnb-4bit",    max_seq_length=MAX_SEQ_LEN,    load_in_4bit=True,)model = FastLanguageModel.get_peft_model(    model,    r=RANK,    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],    lora_alpha=RANK * 2,    lora_dropout=0.05,    bias="none",    use_gradient_checkpointing=True,)dataset = load_dataset("json", data_files="train.jsonl", split="train")TEMPLATE = "<|system|>\n{system}\n<|user|>\n{input}\n<|assistant|>\n{output}"def format_chat(ex):    return {"text": TEMPLATE.format(**ex)}dataset = dataset.map(format_chat)trainer = SFTTrainer(    model=model,    tokenizer=tokenizer,    train_dataset=dataset,    dataset_text_field="text",    max_seq_length=MAX_SEQ_LEN,    args=TrainingArguments(        output_dir="./output",        num_train_epochs=3,        per_device_train_batch_size=2,        gradient_accumulation_steps=4,        learning_rate=2e-4,        fp16=True,        logging_steps=10,        save_steps=200,        warmup_ratio=0.03,        lr_scheduler_type="cosine",        report_to="none",    ),)trainer.train()model.save_pretrained_merged("./my-model", tokenizer, save_method="merged_16bit")

Unsloth speeds up training 2–3× versus the standard HuggingFace Trainer and reduces VRAM usage by a further 50–60%. The finished model can be loaded via Ollama (local deployment and testing), vLLM (high-throughput production API) or Hugging Face Inference Endpoints.

Where to train and what does it cost?

PlatformBase modelGPUCost 1,000 examples × 3 epochsDeployment
OpenAI FT APIGPT-4o-miniManaged$6–$15API immediately
HuggingFace + RunPodLlama / QwenRTX 4090 (24 GB)$1–$3Inference Endpoints
Lambda LabsLlama / QwenA100 (80 GB)$8–$15Self-deploy / vLLM
Google Vertex AIGemma / customManaged TPU$20–$50Vertex AI Prediction
Modal.comLlama / MistralA10G on-demand$3–$8Serverless API
Unsloth + Colab ProLlama 3 up to 13BT4 / A100$0 (GPU limit)Export GGUF to Ollama

How do you evaluate whether fine-tuning improved the model?

A subjective "it looks better" is not enough in a production environment. You need measurable metrics:

  • LLM-as-judge — GPT-4o rates 1–5 each response from the fine-tuned model vs the base model on a 100-example holdout set; cheap, scalable and highly correlated with human ratings
  • Task accuracy — for classification: accuracy and F1; for data extraction: precision and recall on key fields
  • Regression test — 50 examples outside the fine-tuning domain; verify the model hasn't lost general skills (catastrophic forgetting)
  • Latency and cost — does the shorter system prompt after fine-tuning actually reduce call time and cost by the planned 40–60%?
  • Hallucination rate — if the model outputs facts: what percentage of responses contain fabricated information before and after fine-tuning?

Most common fine-tuning mistakes

  • Fine-tuning instead of a good prompt — 80% of style and format problems are solved by an optimal system prompt with 5 few-shot examples; check this before spending time and money on training
  • Too little data = overfitting — the model perfectly reproduces training data but fails on new inputs; symptom: validation loss rises while training loss falls
  • Too many epochs — 3–5 epochs is usually optimal for SFT; 10+ epochs leads to overfitting and degradation of general model skills
  • Training on unverified data — the model will learn errors and inaccuracies from training data; every example needs human or LLM-judge verification
  • No regression tests — fine-tuning on a narrow domain can distort responses outside that domain; always test general questions after training
  • Vendor lock-in of training data — training data is your strategic asset; store it in vendor-agnostic JSONL and train parallel open-source models as a backup
  • No post-deployment monitoring — a fine-tuned model can degrade as the input distribution shifts; monitor quality metrics regularly and plan retraining cycles

Checklist before fine-tuning

  1. 1.Check whether a better prompt or few-shot examples solves the problem — invest 2–4 hours before going further
  2. 2.Check whether RAG is a better solution (if it is a factual knowledge problem, not a style problem — use RAG instead)
  3. 3.Collect at minimum 500 input/output pairs with QA on each example in JSONL format
  4. 4.Split the data: 90% training, 10% validation, plus 50–100 holdout examples outside the split
  5. 5.Choose a platform based on budget and data privacy requirements (GDPR/HIPAA especially)
  6. 6.Start with a 7B model — fine-tuning 7B is 3× cheaper than 13B at similar quality on narrow tasks
  7. 7.After training, run an A/B test on the holdout set with LLM-as-judge: fine-tuned vs base model vs model with a better prompt
  8. 8.Plan a retraining cycle every 3–6 months as the domain and input patterns evolve

---

I help companies decide whether fine-tuning is the right step — and if so, I run the entire process: from data auditing and preparation through QLoRA training to deployment and quality monitoring. Get in touch — I start with a free 30-minute analysis of your use case.

/// AUTHOR
Paweł Wiszniewski – AI & Web Engineer

Paweł Wiszniewski

SEO & GEO Specialist & AI Engineer

SEO/GEO specialist (10 years) and AI engineer (3 years). I build search visibility, AI systems and automations that reduce costs and improve operational efficiency.

Signal received?

Terminate
Silence

Initiate protocol. Establish connection. Let's build something loud.

> WAITING_FOR_INPUT...