Fine-Tuning Is Usually Wrong
"We should fine-tune a model for this."
I've heard this in every AI project planning meeting. It sounds right — your use case is specific, your data is unique, therefore you need a custom model. But 80% of the time, fine-tuning is the wrong first move.
the decision framework
I use a simple flowchart:
Step 1: Can prompt engineering solve it? Write a good system prompt with clear instructions, a few examples, and structured output constraints. Test on 50 representative inputs. If accuracy is >90%, stop here. You're done.
Step 2: Can RAG solve the knowledge gap? If the model needs domain-specific information it doesn't have — product catalogs, internal docs, company policies — RAG is almost always better than fine-tuning. Fine-tuning bakes knowledge into weights. RAG keeps it in a retrievable, updatable store.
Step 3: Is the issue behavior or knowledge? This is the key distinction.
- Knowledge problem: The model doesn't know your product specs, pricing, or policies. → RAG
- Behavior problem: The model knows the facts but doesn't respond in the right style, format, or reasoning pattern. → Fine-tuning (maybe)
Step 4: Do you have enough high-quality training data? Fine-tuning needs at minimum 100 examples for basic style transfer. For meaningful behavior change, 500-1000+. For complex reasoning patterns, several thousand. If you don't have this data — and you rarely do at project start — fine-tuning isn't an option yet anyway.
why prompt engineering wins more than you'd think
Modern models are remarkably steerable with just a system prompt. Here's what a well-structured prompt can do without any fine-tuning:
system_prompt = """You are a customer support agent for Acme Corp.
RESPONSE RULES:
- Maximum 3 sentences per response
- Always acknowledge the customer's frustration before solving
- Use product names exactly as listed (never abbreviate)
- If you don't know the answer, say "Let me connect you with a specialist"
- Never discuss competitor products
- Never offer discounts unless the customer explicitly asks
TONE: Professional but warm. No corporate jargon. Write like a helpful
human, not a robot.
PRODUCT CATALOG:
- Acme Pro ($49/mo): 100 users, 50GB storage, priority support
- Acme Teams ($99/mo): unlimited users, 500GB, dedicated CSM
- Acme Enterprise: custom pricing, contact sales"""
# This handles 90%+ of support conversations without fine-tuning.The structured prompt with explicit rules beats a fine-tuned model that learned the rules implicitly from examples. Why? Because you can read and edit the rules. When the product changes, you update the prompt in 30 seconds. With a fine-tuned model, you retrain.
when fine-tuning is actually right
There are legitimate use cases. Here are the ones I've seen work:
Consistent style at scale. A company needed every customer communication — emails, chat, docs — to sound like it came from the same voice. They had 5,000 examples of approved communications. Fine-tuning gave them style consistency that prompting couldn't match.
Latency-sensitive classification. A fine-tuned GPT-4o-mini classifies support tickets in ~100ms. The base model with a detailed prompt takes ~400ms. At 50,000 tickets/day, that 300ms matters. Fine-tuning also reduced the prompt from 800 tokens to 50, cutting costs by ~90%.
# Before: 800-token prompt with examples and rules
# After fine-tuning: 50-token prompt
response = await client.chat.completions.create(
model="ft:gpt-4o-mini-2024-07-18:acme:ticket-classifier:abc123",
messages=[
{"role": "system", "content": "Classify the support ticket."},
{"role": "user", "content": ticket_text},
],
)
# Same accuracy, 4x faster, 90% cheaperDomain-specific reasoning. Legal analysis, medical coding, financial regulation — domains where the base model's general knowledge isn't enough and RAG alone can't teach the reasoning pattern. You need both: RAG for facts, fine-tuning for domain-specific inference.
the hidden costs nobody mentions
Fine-tuning isn't just training cost. It's:
- Data curation: 40+ hours to clean, label, and validate 1,000 training examples
- Iteration cycles: Your first fine-tune won't be good enough. Budget 3-5 rounds.
- Evaluation infrastructure: You need automated evals to compare base vs fine-tuned
- Maintenance: When the base model updates, your fine-tune might need retraining
- Vendor lock-in: A fine-tuned GPT-4o-mini doesn't transfer to Claude. Your prompt does.
I've seen teams spend 6 weeks on a fine-tuning pipeline that a 2-hour prompt engineering session would have solved. The sunk cost fallacy kicks in fast — "we've already labeled 2,000 examples, we can't switch to RAG now."
my recommendation
Start with the cheapest intervention and escalate:
- Prompt engineering (hours, $0) — solves 60% of cases
- RAG (days, $100-500/mo for vector DB) — solves another 25%
- Prompt + RAG (days, same cost) — solves another 10%
- Fine-tuning (weeks, $500-5000 per training run) — the remaining 5%
If you're reaching for fine-tuning first, you're optimizing the wrong thing. Invest that time in better prompts, better retrieval, and better evaluation instead. When those genuinely plateau, then fine-tune.
More in AI & LLM
The Context Window Trap
1M token context windows don't mean you should use them. When to chunk, summarize, or restructure instead of stuffing everything in.
LLM Evaluation Is Hard
Building LLM-as-a-Judge for AssessAI taught me why automated evaluation needs human calibration — and where it breaks down.
Voice AI in Production
What it takes to deploy conversational Voice AI that handles real phone calls. Latency budgets, failure modes, and lessons from Avoca.