Skip to content

feat(judge): graded partial-credit scoring with intent-based judging (evals v2)#380

Open
kwikiel wants to merge 1 commit into
callstackincubator:mainfrom
kwikiel:feat/judge-v2-graded-scoring
Open

feat(judge): graded partial-credit scoring with intent-based judging (evals v2)#380
kwikiel wants to merge 1 commit into
callstackincubator:mainfrom
kwikiel:feat/judge-v2-graded-scoring

Conversation

@kwikiel
Copy link
Copy Markdown

@kwikiel kwikiel commented Jun 3, 2026

Summary

The v1 judge scored each requirement as binary pass/fail and graded largely on literal API matching. In practice this rejects functionally correct, idiomatic React Native code when it reaches a requirement's goal via a valid alternative API. Concrete examples found while auditing a real run:

  • A login flow that prevents back-navigation with navigation.reset(...) was failed because the requirement named the declarative if-guard pattern — even though the behavior the prompt asked for is satisfied.
  • A velocity-aware snap animation using withSpring was failed because the requirement named withTiming — even though it is correct, idiomatic, and arguably a better fit for the prompt's "release velocity" wording.

This PR introduces methodology v2: graded partial credit + intent-based judging, so the score measures idiomatic, working code rather than conformance to one specific API among several valid ones.

What changed

  1. Graded partial credit — each requirement is scored 0..1 (1/0.75/0.5/0.25/0); passed is derived (score >= 0.5). v1 binary is the special case score ∈ {0,1}.
  2. Intent-based judging — the judge grades each requirement on its underlying goal and credits valid idiomatic alternatives; low scores are reserved for unmet goals, broken code, or explicit evidence-gated prohibitions.
  3. Code-quality dimension — per-eval codeQuality score + run-level averageCodeQuality, measuring production ("Callstack") craft independent of literal matching.
  4. File-path-aware prompts — submitted files are tagged with their path (previously the judge saw unlabeled file contents, which weakened path/import-scoped criteria).
  5. Version marker — per-eval results and summary carry methodologyVersion: 2.

Backward compatible: judge rows that omit score fall back to the binary passed flag.

⚠️ Re-ranking (breaking for the leaderboard)

v2 credits valid alternatives v1 failed, so v2 scores are systematically higher and are not comparable to published v1 numbers. Any cross-model leaderboard must be re-judged under v2. Re-judging only needs stored generation artifacts (the judge stage is independent of generation):

bun runner/judge.ts --model <judge> --input generated/<model-run> --output runs/<model-run>-v2

The published 18-model leaderboard predates v2 and should be re-judged from the maintainers' archived generations before any v2 comparison. See docs/judge-v2-graded-scoring.md.

Seed result (proof it's discriminating, not inflationary)

deepseek/deepseek-v4-flash judged by claude-sonnet-4.6, 66/67 evals, same generation re-judged under v1 and v2:

Category v1 v2 Δ codeQuality
animation 0.614 0.679 +0.065 0.713
async-state 0.650 0.689 +0.039 0.723
lists 0.676 0.709 +0.034 0.762
navigation 0.872 0.895 +0.023 0.828
react-native-apis 0.889 0.917 +0.028 0.856
expo-sdk 1.000 0.986 −0.014 0.820
ALL 0.732 0.769 +0.037 0.772

Gains concentrate where valid-alternative penalties were common (animation, async-state); expo-sdk slightly decreases (graded scoring can dock a perfect eval for a minor quality gap). Spot-checked: behavioral requirements the model satisfied gained partial credit, requirements explicitly mandating a specific mechanism stayed at 0, and genuine errors (importing FlashList for a LegendList task, missing keyExtractor) still score 0.

Validation

  • bun test runner — 28/28 pass (incl. new graded partial-credit + clamping/fallback tests)
  • bun lint clean; tsc --noEmit clean on the judge path
  • Whitepaper synced per AGENTS.md (judge methodology, scoring formula, summary metrics, limitations, new versioning section)

Out of scope / follow-ups

  • A separate small PR fixes a Docker cleanup EACCES (root-owned temp dirs failing otherwise-successful evals).
  • Running OpenRouter models surfaced a multi-slash model-ID parse bug in the ai-sdk-provider-opencode-sdk dependency (openrouter/deepseek/deepseek-v4-flash mis-split) — to be filed upstream.

🤖 Generated with Claude Code

…(v2)

Replace binary per-requirement pass/fail with graded partial credit and
intent-based judging, so functionally correct, idiomatic React Native code is
credited even when it reaches a requirement's goal via a valid alternative API.

- score each requirement 0..1 (passed derived as score >= 0.5); weighted mean
  drives scoreRatio (v1 binary is the special case score in {0,1})
- judge grades on intent and credits valid idiomatic alternatives; penalties
  reserved for unmet goals, broken code, or explicit evidence-gated prohibitions
- add per-eval codeQuality score and run-level averageCodeQuality
- tag judge prompt files with their path (path/import-scoped criteria)
- mark outputs with methodologyVersion=2; v2 is not comparable to v1, so
  leaderboards must be re-judged (docs/judge-v2-graded-scoring.md)
- sync whitepaper methodology/scoring/limitations; add graded-credit tests

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant