feat(judge): graded partial-credit scoring with intent-based judging (evals v2)#380
Open
kwikiel wants to merge 1 commit into
Open
feat(judge): graded partial-credit scoring with intent-based judging (evals v2)#380kwikiel wants to merge 1 commit into
kwikiel wants to merge 1 commit into
Conversation
…(v2)
Replace binary per-requirement pass/fail with graded partial credit and
intent-based judging, so functionally correct, idiomatic React Native code is
credited even when it reaches a requirement's goal via a valid alternative API.
- score each requirement 0..1 (passed derived as score >= 0.5); weighted mean
drives scoreRatio (v1 binary is the special case score in {0,1})
- judge grades on intent and credits valid idiomatic alternatives; penalties
reserved for unmet goals, broken code, or explicit evidence-gated prohibitions
- add per-eval codeQuality score and run-level averageCodeQuality
- tag judge prompt files with their path (path/import-scoped criteria)
- mark outputs with methodologyVersion=2; v2 is not comparable to v1, so
leaderboards must be re-judged (docs/judge-v2-graded-scoring.md)
- sync whitepaper methodology/scoring/limitations; add graded-credit tests
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The v1 judge scored each requirement as binary pass/fail and graded largely on literal API matching. In practice this rejects functionally correct, idiomatic React Native code when it reaches a requirement's goal via a valid alternative API. Concrete examples found while auditing a real run:
navigation.reset(...)was failed because the requirement named the declarativeif-guard pattern — even though the behavior the prompt asked for is satisfied.withSpringwas failed because the requirement namedwithTiming— even though it is correct, idiomatic, and arguably a better fit for the prompt's "release velocity" wording.This PR introduces methodology v2: graded partial credit + intent-based judging, so the score measures idiomatic, working code rather than conformance to one specific API among several valid ones.
What changed
0..1(1/0.75/0.5/0.25/0);passedis derived (score >= 0.5). v1 binary is the special casescore ∈ {0,1}.codeQualityscore + run-levelaverageCodeQuality, measuring production ("Callstack") craft independent of literal matching.methodologyVersion: 2.Backward compatible: judge rows that omit
scorefall back to the binarypassedflag.v2 credits valid alternatives v1 failed, so v2 scores are systematically higher and are not comparable to published v1 numbers. Any cross-model leaderboard must be re-judged under v2. Re-judging only needs stored generation artifacts (the judge stage is independent of generation):
The published 18-model leaderboard predates v2 and should be re-judged from the maintainers' archived generations before any v2 comparison. See
docs/judge-v2-graded-scoring.md.Seed result (proof it's discriminating, not inflationary)
deepseek/deepseek-v4-flashjudged byclaude-sonnet-4.6, 66/67 evals, same generation re-judged under v1 and v2:Gains concentrate where valid-alternative penalties were common (animation, async-state);
expo-sdkslightly decreases (graded scoring can dock a perfect eval for a minor quality gap). Spot-checked: behavioral requirements the model satisfied gained partial credit, requirements explicitly mandating a specific mechanism stayed at0, and genuine errors (importingFlashListfor aLegendListtask, missingkeyExtractor) still score0.Validation
bun test runner— 28/28 pass (incl. new graded partial-credit + clamping/fallback tests)bun lintclean;tsc --noEmitclean on the judge pathAGENTS.md(judge methodology, scoring formula, summary metrics, limitations, new versioning section)Out of scope / follow-ups
EACCES(root-owned temp dirs failing otherwise-successful evals).ai-sdk-provider-opencode-sdkdependency (openrouter/deepseek/deepseek-v4-flashmis-split) — to be filed upstream.🤖 Generated with Claude Code