Skip to content

feat: add repeat and repeat-fail-on-threshold inputs#865

Open
tgvashworth wants to merge 1 commit intopromptfoo:mainfrom
incident-io:implement-repeat-threshold
Open

feat: add repeat and repeat-fail-on-threshold inputs#865
tgvashworth wants to merge 1 commit intopromptfoo:mainfrom
incident-io:implement-repeat-threshold

Conversation

@tgvashworth
Copy link
Copy Markdown

  • Adds repeat input that runs each test N times via promptfoo's --repeat flag
  • Adds repeat-fail-on-threshold input that checks each individual test passes a minimum percentage of its repeated runs
  • When thresholds are configured and pass, the action succeeds even if promptfoo exits non-zero (which it does whenever any test fails)
  • Adds info logging showing repeat config and threshold results

Example usage

- uses: promptfoo/promptfoo-action@v1
  with:
    config: evals/skills.yaml
    repeat: 3
    repeat-fail-on-threshold: 66
    fail-on-threshold: 90

This runs each test 3 times and requires each test to pass at least 2/3 runs. The suite-level threshold requires 90% overall pass rate. Both are checked independently.

Testing

  • 96 tests pass locally (npm test)
  • Tested in CI on and internal repo with real LLM evals
  • Verified --repeat 3 is passed to promptfoo and NxR test cases run
  • Verified per-test grouping works correctly (groups by description across repeats)
  • Verified threshold pass suppresses exec error
  • dist/ rebuilt and included

Add two new inputs for handling non-deterministic LLM evals:

- `repeat`: runs each test N times via promptfoo's --repeat flag
- `repeat-fail-on-threshold`: per-test threshold requiring each
  individual test to pass a minimum percentage of its repeated runs

Example: repeat=3 with repeat-fail-on-threshold=66 means each test
must pass at least 2 out of 3 runs. This filters out systematic
failures while tolerating random grader variance.

Key design decisions:

- Per-test best-of-N, not global aggregate: results are grouped by
  test description (or vars as fallback) and each test is checked
  independently against the threshold
- Both fail-on-threshold and repeat-fail-on-threshold run independently
  when both are set
- When thresholds are configured and pass, the action succeeds even if
  promptfoo exits non-zero (which it does whenever any test fails)
- Info logging shows repeat config, threshold results, and clear
  explanations when exec errors are suppressed

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 12, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 469431f8-2d61-4da3-9331-ff3ce12530f8

📥 Commits

Reviewing files that changed from the base of the PR and between 88c88ba and 7427029.

⛔ Files ignored due to path filters (2)
  • dist/index.js is excluded by !**/dist/**
  • dist/index.js.map is excluded by !**/dist/**, !**/*.map
📒 Files selected for processing (5)
  • README.md
  • __tests__/main.test.ts
  • action.yml
  • src/main.ts
  • src/utils/errors.ts

Walkthrough

This pull request introduces repeat functionality for handling flaky LLM evaluations. Two new action inputs are added: repeat (number of test repetitions, default 1) and repeat-fail-on-threshold (minimum per-test pass rate percentage). The implementation includes validation for both inputs, per-test aggregation logic to compute pass rates across repeated runs, and enhanced error handling to tolerate non-zero exit codes when thresholds pass. Documentation is updated with usage examples, and comprehensive test coverage is added for repeat scenarios, threshold validation, and combined behaviors. No code logic changes affect existing functionality when these inputs are not used.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The PR title accurately describes the main feature additions: adding repeat and repeat-fail-on-threshold inputs as shown across action.yml, src/main.ts, and test files.
Description check ✅ Passed The description directly addresses the changeset, explaining the repeat and repeat-fail-on-threshold features with usage examples, test coverage, and implementation details.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

mldangelo added a commit that referenced this pull request Mar 14, 2026
Redesign the repeat/threshold feature from PR #865 for better DX:

- Rename repeat-fail-on-threshold to repeat-min-pass (absolute count
  instead of percentage — "2 of 3" not "66%")
- Change repeat default from '1' to '' (omitted = absent)
- Add strict input parsing that rejects "2.5", "3abc", "02"
- Add cross-field validation (repeat >= 2, min-pass <= repeat)
- Scope exec error suppression to repeat-min-pass only (preserves
  backward compat for fail-on-threshold users)
- Use ignoreReturnCode instead of try/catch on exec
- Use unique per-run output file path to prevent stale results
- Fail hard on ambiguous/partial grouping instead of warning
- Add repeat summary to PR comments and workflow summaries
- Extract input parsing and threshold logic into utility modules
- Add 130 tests with 100% coverage on new utilities

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant