AI & LLMApril 2, 20266 min read

LLM Evaluation Is Hard

aillmevaluationassessai

For AssessAI, I needed to evaluate candidates' problem-solving ability using an LLM. Not sentiment analysis or classification — actual judgment calls. "Did this candidate demonstrate strong system thinking?" "Is their problem decomposition thorough or surface-level?"

Building this taught me that LLM evaluation is a fundamentally different problem than LLM generation, and most of the techniques that work for generation don't transfer.

why automated eval is necessary

A human evaluator takes 15-20 minutes to grade one candidate assessment. At scale — 500 candidates/month — that's 125-167 hours of evaluator time. At $50/hour for qualified evaluators, that's $6,250-$8,350/month just for grading.

GPT-4o evaluates a candidate response in ~5 seconds for ~$0.03. The economics aren't close. But the quality has to be close enough that the cost savings actually hold.

LLM-as-a-Judge: the naive version

The first version was embarrassingly simple:

eval_prompt = """Score this candidate's response on a scale of 1-10
for problem decomposition quality.
 
Candidate's response:
{response}
 
Score (1-10):"""

This "worked" in the sense that it returned numbers between 1 and 10. It didn't work in any meaningful sense:

Positivity bias. Average score: 7.2 out of 10. The model almost never scored below 5, even for clearly poor responses. It's an agreeable system trained on human feedback — it doesn't want to be harsh.
Inconsistency. The same response scored 6 on one run and 8 on the next. Temperature 0 helped but didn't eliminate variance.
Length bias. Longer responses consistently scored higher, regardless of quality. A rambling, unfocused 500-word answer outscored a concise, correct 100-word answer.
Style over substance. Well-formatted responses with bullet points and headers scored higher than unstructured but insightful ones.

what made it work: rubric-based evaluation

The fix was giving the model a detailed rubric instead of a vague instruction:

eval_prompt = """Evaluate this candidate's problem decomposition using
the rubric below. For EACH criterion, provide a score and one-sentence
justification.
 
RUBRIC:
1. Identifies core sub-problems (0-3)
   0: No decomposition attempted
   1: Lists obvious components only
   2: Identifies non-obvious sub-problems
   3: Comprehensive decomposition with dependencies mapped
 
2. Prioritization (0-2)
   0: No prioritization
   1: Some ordering but rationale unclear
   2: Clear priority with explicit rationale
 
3. Edge cases acknowledged (0-2)
   0: No edge cases mentioned
   1: Obvious edge cases only
   2: Non-obvious edge cases identified
 
4. Feasibility awareness (0-3)
   0: No feasibility consideration
   1: Mentions constraints
   2: Identifies specific technical constraints
   3: Proposes solutions to constraints
 
Candidate's response:
{response}
 
Evaluate each criterion with score and justification:"""

This reduced the average score from 7.2 to 5.8 (more realistic distribution), cut variance by ~60%, and made the length bias mostly disappear. The rubric forces the model to evaluate against specific criteria instead of giving a gut-feel number.

the calibration problem

Even with rubrics, the model's scores didn't match human evaluators. Our calibration process:

Collect human labels. 3 evaluators independently scored 200 candidate responses using the same rubric. Inter-rater agreement (Cohen's kappa): 0.72 — "substantial agreement."
Compare to LLM scores. Ran the same 200 responses through the LLM evaluator. Correlation with human average: 0.61 initially.
Analyze disagreements. Where the LLM and humans diverged most:
- Technical depth: LLM couldn't distinguish between superficially correct and deeply correct technical reasoning
- Creativity: LLM undervalued novel approaches that didn't match common patterns
- Conciseness: Humans valued brevity; LLM still had residual length bias
Refine the rubric. Added specific examples for each score level. "A score of 3 on technical depth looks like [concrete example]. A score of 1 looks like [concrete example]." Correlation improved to 0.78.
Add reference responses. Included 2 calibration examples in the prompt — one high-scoring and one low-scoring response with full rubric evaluations. This anchored the model's scoring. Final correlation: 0.83.

calibration_examples = """
REFERENCE RESPONSE A (Score: 9/10):
[full response text]
Evaluation: Decomposition 3/3 — identified 5 sub-problems including
non-obvious data consistency issue. Prioritization 2/2 — clear MoSCoW
ordering with rationale...
 
REFERENCE RESPONSE B (Score: 3/10):
[full response text]
Evaluation: Decomposition 1/3 — only listed the three obvious components
from the prompt, no original analysis...
"""

multi-judge reduces variance

One evaluator model has variance. Three evaluator models and a median reduces it dramatically.

async def multi_judge_evaluate(response: str, rubric: str) -> dict:
    # Run 3 independent evaluations
    evals = await asyncio.gather(
        evaluate(response, rubric, temperature=0.3),
        evaluate(response, rubric, temperature=0.3),
        evaluate(response, rubric, temperature=0.3),
    )
 
    # Take median score for each criterion
    final_scores = {}
    for criterion in rubric_criteria:
        scores = [e[criterion] for e in evals]
        final_scores[criterion] = sorted(scores)[1]  # median of 3
 
    return final_scores

3x the cost. Worth it. Variance dropped another 40% and human-LLM correlation reached 0.87. Good enough for automated screening. Not good enough to replace human review on borderline candidates.

where LLM evaluation still fails

Detecting genuine insight vs pattern matching. A candidate who recites the textbook answer and a candidate who derives the same answer from first principles get similar scores. Humans can tell the difference. LLMs mostly can't.

Evaluating creativity. Novel approaches that don't match training data get penalized. The model evaluates against patterns it's seen, not against the problem's actual requirements.

Detecting subtle BS. Candidates who write confidently about things they don't understand. The LLM rates their confidence as competence. Experienced human evaluators catch this in seconds.

Cross-cultural communication styles. Direct vs indirect communication, different approaches to uncertainty, varying levels of qualification hedging. The model's evaluation reflects Western communication norms.

the pragmatic approach

LLM evaluation works for:

First-pass screening (remove clearly unqualified candidates)
Structured rubric evaluation (with calibrated rubrics and reference examples)
Consistency checking (flagging outlier scores for human review)

LLM evaluation doesn't replace:

Final hiring decisions
Borderline case evaluation
Assessment of soft skills and cultural fit

My production setup: LLM evaluates all candidates automatically. Humans review the top 30% and any candidate where the LLM's confidence score is below 0.7. This cuts human evaluation time by ~65% while maintaining quality on the decisions that matter.

Automated eval is a filter, not a judge. Treat it accordingly.

LLM Evaluation Is Hard

why automated eval is necessary

LLM-as-a-Judge: the naive version

what made it work: rubric-based evaluation

the calibration problem

multi-judge reduces variance

where LLM evaluation still fails

the pragmatic approach

More in AI & LLM

The Context Window Trap

Voice AI in Production

Embedding Models Compared: Cost, Quality, Latency