feat: add repeat and repeat-fail-on-threshold inputs#865
feat: add repeat and repeat-fail-on-threshold inputs#865tgvashworth wants to merge 1 commit intopromptfoo:mainfrom
Conversation
Add two new inputs for handling non-deterministic LLM evals: - `repeat`: runs each test N times via promptfoo's --repeat flag - `repeat-fail-on-threshold`: per-test threshold requiring each individual test to pass a minimum percentage of its repeated runs Example: repeat=3 with repeat-fail-on-threshold=66 means each test must pass at least 2 out of 3 runs. This filters out systematic failures while tolerating random grader variance. Key design decisions: - Per-test best-of-N, not global aggregate: results are grouped by test description (or vars as fallback) and each test is checked independently against the threshold - Both fail-on-threshold and repeat-fail-on-threshold run independently when both are set - When thresholds are configured and pass, the action succeeds even if promptfoo exits non-zero (which it does whenever any test fails) - Info logging shows repeat config, threshold results, and clear explanations when exec errors are suppressed Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: ⛔ Files ignored due to path filters (2)
📒 Files selected for processing (5)
WalkthroughThis pull request introduces repeat functionality for handling flaky LLM evaluations. Two new action inputs are added: Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Redesign the repeat/threshold feature from PR #865 for better DX: - Rename repeat-fail-on-threshold to repeat-min-pass (absolute count instead of percentage — "2 of 3" not "66%") - Change repeat default from '1' to '' (omitted = absent) - Add strict input parsing that rejects "2.5", "3abc", "02" - Add cross-field validation (repeat >= 2, min-pass <= repeat) - Scope exec error suppression to repeat-min-pass only (preserves backward compat for fail-on-threshold users) - Use ignoreReturnCode instead of try/catch on exec - Use unique per-run output file path to prevent stale results - Fail hard on ambiguous/partial grouping instead of warning - Add repeat summary to PR comments and workflow summaries - Extract input parsing and threshold logic into utility modules - Add 130 tests with 100% coverage on new utilities Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
repeatinput that runs each test N times via promptfoo's--repeatflagrepeat-fail-on-thresholdinput that checks each individual test passes a minimum percentage of its repeated runsExample usage
This runs each test 3 times and requires each test to pass at least 2/3 runs. The suite-level threshold requires 90% overall pass rate. Both are checked independently.
Testing
npm test)--repeat 3is passed to promptfoo and NxR test cases rundist/rebuilt and included