feat(judge): graded partial-credit scoring with intent-based judging (evals v2) by kwikiel · Pull Request #380 · callstackincubator/evals

kwikiel · 2026-06-03T11:14:50Z

Summary

The v1 judge scored each requirement as binary pass/fail and graded largely on literal API matching. In practice this rejects functionally correct, idiomatic React Native code when it reaches a requirement's goal via a valid alternative API. Concrete examples found while auditing a real run:

A login flow that prevents back-navigation with navigation.reset(...) was failed because the requirement named the declarative if-guard pattern — even though the behavior the prompt asked for is satisfied.
A velocity-aware snap animation using withSpring was failed because the requirement named withTiming — even though it is correct, idiomatic, and arguably a better fit for the prompt's "release velocity" wording.

This PR introduces methodology v2: graded partial credit + intent-based judging, so the score measures idiomatic, working code rather than conformance to one specific API among several valid ones.

What changed

Graded partial credit — each requirement is scored 0..1 (1/0.75/0.5/0.25/0); passed is derived (score >= 0.5). v1 binary is the special case score ∈ {0,1}.
Intent-based judging — the judge grades each requirement on its underlying goal and credits valid idiomatic alternatives; low scores are reserved for unmet goals, broken code, or explicit evidence-gated prohibitions.
Code-quality dimension — per-eval codeQuality score + run-level averageCodeQuality, measuring production ("Callstack") craft independent of literal matching.
File-path-aware prompts — submitted files are tagged with their path (previously the judge saw unlabeled file contents, which weakened path/import-scoped criteria).
Version marker — per-eval results and summary carry methodologyVersion: 2.

Backward compatible: judge rows that omit score fall back to the binary passed flag.

⚠️ Re-ranking (breaking for the leaderboard)

v2 credits valid alternatives v1 failed, so v2 scores are systematically higher and are not comparable to published v1 numbers. Any cross-model leaderboard must be re-judged under v2. Re-judging only needs stored generation artifacts (the judge stage is independent of generation):

bun runner/judge.ts --model <judge> --input generated/<model-run> --output runs/<model-run>-v2

The published 18-model leaderboard predates v2 and should be re-judged from the maintainers' archived generations before any v2 comparison. See docs/judge-v2-graded-scoring.md.

Seed result (proof it's discriminating, not inflationary)

deepseek/deepseek-v4-flash judged by claude-sonnet-4.6, 66/67 evals, same generation re-judged under v1 and v2:

Category	v1	v2	Δ	codeQuality
animation	0.614	0.679	+0.065	0.713
async-state	0.650	0.689	+0.039	0.723
lists	0.676	0.709	+0.034	0.762
navigation	0.872	0.895	+0.023	0.828
react-native-apis	0.889	0.917	+0.028	0.856
expo-sdk	1.000	0.986	−0.014	0.820
ALL	0.732	0.769	+0.037	0.772

Gains concentrate where valid-alternative penalties were common (animation, async-state); expo-sdk slightly decreases (graded scoring can dock a perfect eval for a minor quality gap). Spot-checked: behavioral requirements the model satisfied gained partial credit, requirements explicitly mandating a specific mechanism stayed at 0, and genuine errors (importing FlashList for a LegendList task, missing keyExtractor) still score 0.

Validation

bun test runner — 28/28 pass (incl. new graded partial-credit + clamping/fallback tests)
bun lint clean; tsc --noEmit clean on the judge path
Whitepaper synced per AGENTS.md (judge methodology, scoring formula, summary metrics, limitations, new versioning section)

Out of scope / follow-ups

A separate small PR fixes a Docker cleanup EACCES (root-owned temp dirs failing otherwise-successful evals).
Running OpenRouter models surfaced a multi-slash model-ID parse bug in the ai-sdk-provider-opencode-sdk dependency (openrouter/deepseek/deepseek-v4-flash mis-split) — to be filed upstream.

🤖 Generated with Claude Code

…(v2) Replace binary per-requirement pass/fail with graded partial credit and intent-based judging, so functionally correct, idiomatic React Native code is credited even when it reaches a requirement's goal via a valid alternative API. - score each requirement 0..1 (passed derived as score >= 0.5); weighted mean drives scoreRatio (v1 binary is the special case score in {0,1}) - judge grades on intent and credits valid idiomatic alternatives; penalties reserved for unmet goals, broken code, or explicit evidence-gated prohibitions - add per-eval codeQuality score and run-level averageCodeQuality - tag judge prompt files with their path (path/import-scoped criteria) - mark outputs with methodologyVersion=2; v2 is not comparable to v1, so leaderboards must be re-judged (docs/judge-v2-graded-scoring.md) - sync whitepaper methodology/scoring/limitations; add graded-credit tests Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(judge): graded partial-credit scoring with intent-based judging (evals v2)#380

feat(judge): graded partial-credit scoring with intent-based judging (evals v2)#380
kwikiel wants to merge 1 commit into
callstackincubator:mainfrom
kwikiel:feat/judge-v2-graded-scoring

kwikiel commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kwikiel commented Jun 3, 2026

Summary

What changed

⚠️ Re-ranking (breaking for the leaderboard)

Seed result (proof it's discriminating, not inflationary)

Validation

Out of scope / follow-ups

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant