LLM Evaluation Is Hard
For AssessAI, I needed to evaluate candidates' problem-solving ability using an LLM. Not sentiment analysis or classification — actual judgment calls. "Did this candidate demonstrate strong system thinking?" "Is their problem decomposition thorough or surface-level?"
Building this taught me that LLM evaluation is a fundamentally different problem than LLM generation, and most of the techniques that work for generation don't transfer.
why automated eval is necessary
A human evaluator takes 15-20 minutes to grade one candidate assessment. At scale — 500 candidates/month — that's 125-167 hours of evaluator time. At $50/hour for qualified evaluators, that's $6,250-$8,350/month just for grading.
GPT-4o evaluates a candidate response in ~5 seconds for ~$0.03. The economics aren't close. But the quality has to be close enough that the cost savings actually hold.
LLM-as-a-Judge: the naive version
The first version was embarrassingly simple:
eval_prompt = """Score this candidate's response on a scale of 1-10
for problem decomposition quality.
Candidate's response:
{response}
Score (1-10):"""This "worked" in the sense that it returned numbers between 1 and 10. It didn't work in any meaningful sense:
- Positivity bias. Average score: 7.2 out of 10. The model almost never scored below 5, even for clearly poor responses. It's an agreeable system trained on human feedback — it doesn't want to be harsh.
- Inconsistency. The same response scored 6 on one run and 8 on the next. Temperature 0 helped but didn't eliminate variance.
- Length bias. Longer responses consistently scored higher, regardless of quality. A rambling, unfocused 500-word answer outscored a concise, correct 100-word answer.
- Style over substance. Well-formatted responses with bullet points and headers scored higher than unstructured but insightful ones.
what made it work: rubric-based evaluation
The fix was giving the model a detailed rubric instead of a vague instruction:
eval_prompt = """Evaluate this candidate's problem decomposition using
the rubric below. For EACH criterion, provide a score and one-sentence
justification.
RUBRIC:
1. Identifies core sub-problems (0-3)
0: No decomposition attempted
1: Lists obvious components only
2: Identifies non-obvious sub-problems
3: Comprehensive decomposition with dependencies mapped
2. Prioritization (0-2)
0: No prioritization
1: Some ordering but rationale unclear
2: Clear priority with explicit rationale
3. Edge cases acknowledged (0-2)
0: No edge cases mentioned
1: Obvious edge cases only
2: Non-obvious edge cases identified
4. Feasibility awareness (0-3)
0: No feasibility consideration
1: Mentions constraints
2: Identifies specific technical constraints
3: Proposes solutions to constraints
Candidate's response:
{response}
Evaluate each criterion with score and justification:"""This reduced the average score from 7.2 to 5.8 (more realistic distribution), cut variance by ~60%, and made the length bias mostly disappear. The rubric forces the model to evaluate against specific criteria instead of giving a gut-feel number.
the calibration problem
Even with rubrics, the model's scores didn't match human evaluators. Our calibration process:
-
Collect human labels. 3 evaluators independently scored 200 candidate responses using the same rubric. Inter-rater agreement (Cohen's kappa): 0.72 — "substantial agreement."
-
Compare to LLM scores. Ran the same 200 responses through the LLM evaluator. Correlation with human average: 0.61 initially.
-
Analyze disagreements. Where the LLM and humans diverged most:
- Technical depth: LLM couldn't distinguish between superficially correct and deeply correct technical reasoning
- Creativity: LLM undervalued novel approaches that didn't match common patterns
- Conciseness: Humans valued brevity; LLM still had residual length bias
-
Refine the rubric. Added specific examples for each score level. "A score of 3 on technical depth looks like [concrete example]. A score of 1 looks like [concrete example]." Correlation improved to 0.78.
-
Add reference responses. Included 2 calibration examples in the prompt — one high-scoring and one low-scoring response with full rubric evaluations. This anchored the model's scoring. Final correlation: 0.83.
calibration_examples = """
REFERENCE RESPONSE A (Score: 9/10):
[full response text]
Evaluation: Decomposition 3/3 — identified 5 sub-problems including
non-obvious data consistency issue. Prioritization 2/2 — clear MoSCoW
ordering with rationale...
REFERENCE RESPONSE B (Score: 3/10):
[full response text]
Evaluation: Decomposition 1/3 — only listed the three obvious components
from the prompt, no original analysis...
"""multi-judge reduces variance
One evaluator model has variance. Three evaluator models and a median reduces it dramatically.
async def multi_judge_evaluate(response: str, rubric: str) -> dict:
# Run 3 independent evaluations
evals = await asyncio.gather(
evaluate(response, rubric, temperature=0.3),
evaluate(response, rubric, temperature=0.3),
evaluate(response, rubric, temperature=0.3),
)
# Take median score for each criterion
final_scores = {}
for criterion in rubric_criteria:
scores = [e[criterion] for e in evals]
final_scores[criterion] = sorted(scores)[1] # median of 3
return final_scores3x the cost. Worth it. Variance dropped another 40% and human-LLM correlation reached 0.87. Good enough for automated screening. Not good enough to replace human review on borderline candidates.
where LLM evaluation still fails
Detecting genuine insight vs pattern matching. A candidate who recites the textbook answer and a candidate who derives the same answer from first principles get similar scores. Humans can tell the difference. LLMs mostly can't.
Evaluating creativity. Novel approaches that don't match training data get penalized. The model evaluates against patterns it's seen, not against the problem's actual requirements.
Detecting subtle BS. Candidates who write confidently about things they don't understand. The LLM rates their confidence as competence. Experienced human evaluators catch this in seconds.
Cross-cultural communication styles. Direct vs indirect communication, different approaches to uncertainty, varying levels of qualification hedging. The model's evaluation reflects Western communication norms.
the pragmatic approach
LLM evaluation works for:
- First-pass screening (remove clearly unqualified candidates)
- Structured rubric evaluation (with calibrated rubrics and reference examples)
- Consistency checking (flagging outlier scores for human review)
LLM evaluation doesn't replace:
- Final hiring decisions
- Borderline case evaluation
- Assessment of soft skills and cultural fit
My production setup: LLM evaluates all candidates automatically. Humans review the top 30% and any candidate where the LLM's confidence score is below 0.7. This cuts human evaluation time by ~65% while maintaining quality on the decisions that matter.
Automated eval is a filter, not a judge. Treat it accordingly.
More in AI & LLM
The Context Window Trap
1M token context windows don't mean you should use them. When to chunk, summarize, or restructure instead of stuffing everything in.
Voice AI in Production
What it takes to deploy conversational Voice AI that handles real phone calls. Latency budgets, failure modes, and lessons from Avoca.
Embedding Models Compared: Cost, Quality, Latency
OpenAI ada-002 vs text-embedding-3 vs Cohere vs local models. Real benchmarks from production retrieval systems.