Building AssessAI: Why Coding Tests Are Broken
8 weeks ago i decided to fix something that's been broken for years.
I watched three companies reject strong candidates because they couldn't reverse a linked list under pressure. These same people had shipped products to millions of users. One of them designed a system handling 50k concurrent connections. Failed the interview because she blanked on the optimal DP solution.
That's not a hiring process. That's a hazing ritual.
the problem is structural
It's 2026. AI writes production code faster than most humans. GitHub Copilot completes 46% of code in repos that use it. The skill that matters now isn't "implement Dijkstra's from memory" — it's "can you think clearly about products, systems, and tradeoffs?"
LeetCode tests none of that. It tests pattern matching and memorization under artificial time pressure. The signal-to-noise ratio has collapsed.
What actually matters for a senior engineer: Can they decompose ambiguous problems? Do they think about edge cases before writing code? Can they use AI tools effectively instead of fighting them? Do they understand the real tradeoffs — cost vs. latency vs. consistency vs. team velocity?
what assessai does
AssessAI scores candidates across 5 dimensions that actually predict job performance:
- Problem Decomposition — breaking ambiguous requirements into concrete steps
- System Thinking — reasoning about tradeoffs, constraints, failure modes
- AI Collaboration — this is the big one
- Communication — clarity of reasoning, not just correctness
- Technical Depth — architecture quality, not syntax recall
The flagship feature is LLM Interaction Mode. Candidates get an AI collaborator during the assessment. We don't just let them use it — we measure how they use it. Do they ask precise questions? Do they verify the AI's output or blindly paste it? Do they iterate on the design or accept the first answer?
This is the skill gap that matters now. Not whether you memorized BFS.
technical decisions that mattered
Next.js 15 with React 19. Supabase for everything backend — auth, database, real-time, edge functions. OpenAI for evaluation. Vercel AI SDK for streaming.
The stack was deliberate. Supabase meant I could move fast on auth and RLS without building a separate API layer. React 19's server components let me keep the bundle lean. But two decisions burned me:
OpenAI + complex Zod schemas = unreliable. I had structured output schemas with nested arrays of enums and optional sub-objects. The model would hallucinate field values or just skip required fields entirely. Took me two days to figure out the pattern:
// This breaks — too much nesting for reliable structured output
const complexSchema = z.object({
analysis: z.object({
dimensions: z.array(z.object({
name: z.enum(["decomposition", "system", "ai"]),
score: z.number().min(0).max(10),
subScores: z.array(z.object({
criteria: z.string(),
value: z.number(),
})).optional(),
})),
}),
});
// This works — flat, simple, all optional
const simpleSchema = z.object({
decomposition_score: z.number().optional(),
decomposition_reasoning: z.string().optional(),
system_score: z.number().optional(),
system_reasoning: z.string().optional(),
});Keep structured output schemas shallow. One level deep, optional fields, no enums inside arrays. The model handles that reliably.
React 19 form actions broke Chrome-based testing. Programmatic input doesn't trigger React 19's new form action handlers the way it does with traditional onChange. I burned a full day on this before switching to API-level testing for form flows.
things that went wrong
The caching bug. Three days. React Query's staleTime was set to 30 seconds, but the candidate workspace had multiple components reading the same query key with different params. Stale data was rendering in the wrong workspace tab. The fix was straightforward once I found it — normalize query keys per question — but finding it meant tracing state through six layers of component hierarchy.
Concurrent auto-saves from the candidate workspace would race. I shipped delete-then-insert logic initially. Two saves hitting the server 200ms apart meant the second delete would wipe the first insert. The fix:
INSERT INTO candidate_responses (session_id, question_id, response)
VALUES ($1, $2, $3)
ON CONFLICT (session_id, question_id)
DO UPDATE SET response = EXCLUDED.response, updated_at = now();Upsert everything. Always.
The AI ghostwriting bug was subtle. The extraction service was pulling deliverables from all chat messages — including the AI's own responses. So the AI was effectively writing the candidate's deliverable for them. Had to filter to only candidate messages during extraction.
things that went right
630+ unit tests. I wrote them first (TDD, enforced by Claude Code's tdd-guide agent). This caught 30+ bugs in a single marathon session. The test suite is the reason I could ship fast without breaking things.
Small models are fast enough. gpt-4o-mini handles 90% of the evaluation pipeline. The full gpt-4o model only runs for final comprehensive scoring. This cut API costs by roughly 8x while keeping evaluation quality high.
The anti-cheating system uses a 3-tier escalation: gentle warning on first tab switch, firm warning on second, auto-flag on third. useRef for the counter instead of useState — local variables get captured in stale closures with React 19's concurrent rendering.
where it stands
275 questions across 20 categories. 291 test scenarios with Chrome-based verification. 11 database migrations. The build passes with zero TypeScript errors.
The real product insight isn't the tech. It's that assessment tools should evaluate how engineers work with AI, not how they work without it. The industry hasn't caught up to that yet. AssessAI is my bet that it will.
More in Product
Metrics That Matter Early Stage
Vanity metrics vs. real signal. What I track for AssessAI and why most startup dashboards are performance art.
Solo Founder Stack 2026
Next.js + Supabase + Vercel + Claude Code. The stack that lets one person ship like five.
Product Thinking vs. Coding
Why the best engineers think about users first. The skill gap companies don't test for.