AI & LLMMarch 14, 20264 min read

Fine-Tuning Is Usually Wrong

aillmfine-tuningrag

"We should fine-tune a model for this."

I've heard this in every AI project planning meeting. It sounds right — your use case is specific, your data is unique, therefore you need a custom model. But 80% of the time, fine-tuning is the wrong first move.

the decision framework

I use a simple flowchart:

Step 1: Can prompt engineering solve it? Write a good system prompt with clear instructions, a few examples, and structured output constraints. Test on 50 representative inputs. If accuracy is >90%, stop here. You're done.

Step 2: Can RAG solve the knowledge gap? If the model needs domain-specific information it doesn't have — product catalogs, internal docs, company policies — RAG is almost always better than fine-tuning. Fine-tuning bakes knowledge into weights. RAG keeps it in a retrievable, updatable store.

Step 3: Is the issue behavior or knowledge? This is the key distinction.

Knowledge problem: The model doesn't know your product specs, pricing, or policies. → RAG
Behavior problem: The model knows the facts but doesn't respond in the right style, format, or reasoning pattern. → Fine-tuning (maybe)

Step 4: Do you have enough high-quality training data? Fine-tuning needs at minimum 100 examples for basic style transfer. For meaningful behavior change, 500-1000+. For complex reasoning patterns, several thousand. If you don't have this data — and you rarely do at project start — fine-tuning isn't an option yet anyway.

why prompt engineering wins more than you'd think

Modern models are remarkably steerable with just a system prompt. Here's what a well-structured prompt can do without any fine-tuning:

system_prompt = """You are a customer support agent for Acme Corp.
 
RESPONSE RULES:
- Maximum 3 sentences per response
- Always acknowledge the customer's frustration before solving
- Use product names exactly as listed (never abbreviate)
- If you don't know the answer, say "Let me connect you with a specialist"
- Never discuss competitor products
- Never offer discounts unless the customer explicitly asks
 
TONE: Professional but warm. No corporate jargon. Write like a helpful
human, not a robot.
 
PRODUCT CATALOG:
- Acme Pro ($49/mo): 100 users, 50GB storage, priority support
- Acme Teams ($99/mo): unlimited users, 500GB, dedicated CSM
- Acme Enterprise: custom pricing, contact sales"""
 
# This handles 90%+ of support conversations without fine-tuning.

The structured prompt with explicit rules beats a fine-tuned model that learned the rules implicitly from examples. Why? Because you can read and edit the rules. When the product changes, you update the prompt in 30 seconds. With a fine-tuned model, you retrain.

when fine-tuning is actually right

There are legitimate use cases. Here are the ones I've seen work:

Consistent style at scale. A company needed every customer communication — emails, chat, docs — to sound like it came from the same voice. They had 5,000 examples of approved communications. Fine-tuning gave them style consistency that prompting couldn't match.

Latency-sensitive classification. A fine-tuned GPT-4o-mini classifies support tickets in ~100ms. The base model with a detailed prompt takes ~400ms. At 50,000 tickets/day, that 300ms matters. Fine-tuning also reduced the prompt from 800 tokens to 50, cutting costs by ~90%.

# Before: 800-token prompt with examples and rules
# After fine-tuning: 50-token prompt
response = await client.chat.completions.create(
    model="ft:gpt-4o-mini-2024-07-18:acme:ticket-classifier:abc123",
    messages=[
        {"role": "system", "content": "Classify the support ticket."},
        {"role": "user", "content": ticket_text},
    ],
)
# Same accuracy, 4x faster, 90% cheaper

Domain-specific reasoning. Legal analysis, medical coding, financial regulation — domains where the base model's general knowledge isn't enough and RAG alone can't teach the reasoning pattern. You need both: RAG for facts, fine-tuning for domain-specific inference.

the hidden costs nobody mentions

Fine-tuning isn't just training cost. It's:

Data curation: 40+ hours to clean, label, and validate 1,000 training examples
Iteration cycles: Your first fine-tune won't be good enough. Budget 3-5 rounds.
Evaluation infrastructure: You need automated evals to compare base vs fine-tuned
Maintenance: When the base model updates, your fine-tune might need retraining
Vendor lock-in: A fine-tuned GPT-4o-mini doesn't transfer to Claude. Your prompt does.

I've seen teams spend 6 weeks on a fine-tuning pipeline that a 2-hour prompt engineering session would have solved. The sunk cost fallacy kicks in fast — "we've already labeled 2,000 examples, we can't switch to RAG now."

my recommendation

Start with the cheapest intervention and escalate:

Prompt engineering (hours, $0) — solves 60% of cases
RAG (days, $100-500/mo for vector DB) — solves another 25%
Prompt + RAG (days, same cost) — solves another 10%
Fine-tuning (weeks, $500-5000 per training run) — the remaining 5%

If you're reaching for fine-tuning first, you're optimizing the wrong thing. Invest that time in better prompts, better retrieval, and better evaluation instead. When those genuinely plateau, then fine-tune.

Fine-Tuning Is Usually Wrong

the decision framework

why prompt engineering wins more than you'd think

when fine-tuning is actually right

the hidden costs nobody mentions

my recommendation

More in AI & LLM

The Context Window Trap

LLM Evaluation Is Hard

Voice AI in Production