test: add benchmark comparison metadata fallback coverage by mldangelo-oai · Pull Request #825 · promptfoo/modelaudit

mldangelo-oai · 2026-03-31T22:02:57Z

Summary

preserve baseline metadata fields (size/files/target) when current benchmark entries contain only partial extra_info
add focused regression test covering the partial-metadata comparison path in benchmark reporting
keep change scoped to benchmark report comparison logic only

Validation

/Users/mdangelo/.virtualenvs/openai/bin/ruff format scripts/benchmark_report.py tests/test_benchmark_report.py
/Users/mdangelo/.virtualenvs/openai/bin/ruff check scripts/benchmark_report.py tests/test_benchmark_report.py
/Users/mdangelo/.virtualenvs/openai/bin/mypy tests/test_benchmark_report.py
pytest run is blocked in sandbox by ddtrace PermissionError

Summary by CodeRabbit

Bug Fixes
- Improved benchmark comparison reports to handle incomplete current benchmark data. When current metrics are missing, the tool now uses baseline values as fallback to ensure complete and accurate reporting.

coderabbitai · 2026-03-31T22:03:14Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 2f2b505c-6fba-46fe-b682-2b636bdcc714

📥 Commits

Reviewing files that changed from the base of the PR and between d0d4a2d and fce85f5.

📒 Files selected for processing (2)

scripts/benchmark_report.py
tests/test_benchmark_report.py

Walkthrough

Added a helper function _merged_record_context() to intelligently merge benchmark metadata (target, size, files) between current and baseline records, using current values and falling back to baseline when current values are absent (marked as "-"). Updated the summary builder to use this new function, plus added test coverage for the fallback behavior.

Changes

Cohort / File(s)	Summary
Benchmark Metadata Merging `scripts/benchmark_report.py`	Added `_merged_record_context()` function that selects benchmark context fields (target, size, files) from current record unless they are "-", in which case it falls back to baseline values. Updated `_build_summary()` to use this function when building `ComparisonRow` entries for shared benchmarks.
Test Coverage `tests/test_benchmark_report.py`	Added `test_benchmark_report_uses_baseline_size_when_current_metadata_partial()` to verify that when current benchmark metadata is partial (missing size/file count), the generated output correctly uses baseline values while retaining the current record's path information.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

🐰 A fuzzy-eared fix for gaps in the data,
When benchmarks run thin, we've now got a matter—
Baseline steps in where current falls short,
Merging with grace, of every sort!
No "-" can stop us, we fill every place, 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: adding a test to cover metadata fallback scenarios in benchmark comparison logic.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch automation/test-gap-detection-20260331

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-03-31T22:03:31Z

Workflow run and artifacts

Performance Benchmarks

Compared 6 shared benchmarks with a regression threshold of 15%.
Status: 0 regressions, 1 improved, 5 stable, 0 new, 0 missing.
Aggregate shared-benchmark median: 900.71ms -> 752.49ms (-16.5%).

Top improvements:

tests/benchmarks/test_scan_benchmarks.py::test_scan_safe_pickle -95.9% (155.08ms -> 6.44ms, safe_model.pkl, size=49.4 KiB, files=1)

Benchmark	Target	Size	Files	Baseline	Current	Change	Status
`tests/benchmarks/test_scan_benchmarks.py::test_scan_safe_pickle`	`safe_model.pkl`	49.4 KiB	1	155.08ms	6.44ms	-95.9%	improved
`tests/benchmarks/test_scan_benchmarks.py::test_detect_file_format_safe_pickle`	`safe_model.pkl`	49.4 KiB	1	125.6us	127.7us	+1.6%	stable
`tests/benchmarks/test_scan_benchmarks.py::test_scan_duplicate_directory`	`duplicate-corpus`	840.0 KiB	81	123.54ms	124.68ms	+0.9%	stable
`tests/benchmarks/test_scan_benchmarks.py::test_scan_pytorch_zip`	`state_dict.pt`	1.5 MiB	1	290.11ms	288.72ms	-0.5%	stable
`tests/benchmarks/test_scan_benchmarks.py::test_scan_mixed_directory`	`mixed-corpus`	1.7 MiB	54	331.81ms	332.49ms	+0.2%	stable
`tests/benchmarks/test_scan_benchmarks.py::test_validate_file_type_pytorch_zip`	`state_dict.pt`	1.5 MiB	1	42.3us	42.2us	-0.2%	stable

test: cover benchmark metadata fallback in comparisons

fce85f5

mldangelo-oai added the codex label Mar 31, 2026

mldangelo merged commit ca33c83 into main Mar 31, 2026
24 checks passed

mldangelo deleted the automation/test-gap-detection-20260331 branch March 31, 2026 23:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: add benchmark comparison metadata fallback coverage#825

test: add benchmark comparison metadata fallback coverage#825
mldangelo merged 1 commit intomainfrom
automation/test-gap-detection-20260331

mldangelo-oai commented Mar 31, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 31, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

github-actions bot commented Mar 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mldangelo-oai commented Mar 31, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

github-actions bot commented Mar 31, 2026

Performance Benchmarks

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mldangelo-oai commented Mar 31, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 31, 2026 •

edited

Loading