Skip to content

WIP feat: swe bench scorer#342

Draft
tianmu-li wants to merge 21 commits into
mlcommons:mainfrom
tianmu-li:feat/swe_bench_scorer
Draft

WIP feat: swe bench scorer#342
tianmu-li wants to merge 21 commits into
mlcommons:mainfrom
tianmu-li:feat/swe_bench_scorer

Conversation

@tianmu-li

Copy link
Copy Markdown
Collaborator

What does this PR do?

Type of change

  • Bug fix
  • New feature
  • Documentation update
  • Refactor/cleanup

Related issues

Testing

  • Tests added/updated
  • All tests pass locally
  • Manual testing completed

Checklist

  • Code follows project style
  • Pre-commit hooks pass
  • Documentation updated (if needed)

@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for SWE-bench accuracy evaluation by adding a new accuracy-only SWEBench dataset, a SWEBenchScorer that runs evaluations using mini-swe-agent in an isolated environment, and associated configuration templates, tests, and runbooks. Feedback on the changes focuses on improving the robustness of the SWEBenchScorer implementation, specifically by safely handling missing or null values when parsing configuration templates, benchmark configurations, and evaluation results, as well as gracefully handling cases where the Docker binary is missing from the system's PATH during preflight checks.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +1328 to +1338
docker_result = subprocess.run(
["docker", "version"],
stdout=subprocess.DEVNULL,
stderr=subprocess.PIPE,
timeout=10,
)
if docker_result.returncode != 0:
raise SetupError(
"Docker daemon is not running or docker is not on PATH. "
"Start Docker and retry."
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If docker is not installed or not present on the system's PATH, calling subprocess.run(["docker", ...]) will raise a FileNotFoundError rather than returning a non-zero exit code. This will cause the preflight check to crash with an unhandled traceback. Checking if docker is on PATH using shutil.which and wrapping the execution in a try-except block ensures a clean SetupError is raised.

Suggested change
docker_result = subprocess.run(
["docker", "version"],
stdout=subprocess.DEVNULL,
stderr=subprocess.PIPE,
timeout=10,
)
if docker_result.returncode != 0:
raise SetupError(
"Docker daemon is not running or docker is not on PATH. "
"Start Docker and retry."
)
if shutil.which("docker") is None:
raise SetupError(
"docker is not on PATH. Install Docker and retry."
)
try:
docker_result = subprocess.run(
["docker", "version"],
stdout=subprocess.DEVNULL,
stderr=subprocess.PIPE,
timeout=10,
)
except Exception as e:
raise SetupError(f"Failed to execute docker command: {e}")
if docker_result.returncode != 0:
raise SetupError(
"Docker daemon is not running. Start Docker and retry."
)

Comment on lines +1249 to +1255
with self.swebench_config_template.open() as _f:
_tmpl = yaml.safe_load(_f)
if not isinstance((_tmpl or {}).get("model", {}).get("model_kwargs"), dict):
raise ValueError(
f"swebench template {self.swebench_config_template} must have a "
"'model.model_kwargs' dict; check the template structure."
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The validation of the swebench_config_template structure is fragile and can raise an AttributeError or TypeError if _tmpl is parsed as None (e.g., an empty file), or if "model" is not a dictionary (e.g., a string or None). Using a safer dictionary retrieval pattern prevents these potential runtime crashes.

Suggested change
with self.swebench_config_template.open() as _f:
_tmpl = yaml.safe_load(_f)
if not isinstance((_tmpl or {}).get("model", {}).get("model_kwargs"), dict):
raise ValueError(
f"swebench template {self.swebench_config_template} must have a "
"'model.model_kwargs' dict; check the template structure."
)
with self.swebench_config_template.open() as _f:
_tmpl = yaml.safe_load(_f) or {}
model_cfg = _tmpl.get("model")
if not isinstance(model_cfg, dict) or not isinstance(model_cfg.get("model_kwargs"), dict):
raise ValueError(
f"swebench template {self.swebench_config_template} must have a "
"'model.model_kwargs' dict; check the template structure."
)

Comment on lines +1350 to +1353
model_params = benchmark_config_dict.get("model_params", {})
endpoints = benchmark_config_dict.get("endpoint_config", {}).get(
"endpoints", []
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If endpoint_config or model_params is defined as empty or null in the YAML configuration, benchmark_config_dict.get(...) can return None. Calling .get() on None will raise an AttributeError. Using the or {} pattern ensures safe dictionary access.

        model_params = benchmark_config_dict.get("model_params") or {}
        endpoint_cfg = benchmark_config_dict.get("endpoint_config") or {}
        endpoints = endpoint_cfg.get("endpoints", [])

Comment on lines +1511 to +1512
submitted = result.get("submitted_instances", 0)
resolved = result.get("resolved_instances", 0)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If the SWE-bench evaluation results JSON contains null values for submitted_instances or resolved_instances, .get(..., 0) will return None instead of 0. This will subsequently cause a TypeError when calculating the resolved rate. Using or 0 ensures that any None values are safely defaulted to 0.

Suggested change
submitted = result.get("submitted_instances", 0)
resolved = result.get("resolved_instances", 0)
submitted = result.get("submitted_instances") or 0
resolved = result.get("resolved_instances") or 0

tianmu-li and others added 19 commits June 8, 2026 22:16
Adds SWEBenchScorer (scorer_id="swe_bench_scorer") and a predefined
SWEBench dataset that downloads from princeton-nlp/SWE-bench_Verified
or SWE-bench_Lite on HuggingFace.

The scorer patches a committed swebench_template.yaml with model name,
endpoint URL, and all sampling parameters from model_params at runtime
(no duplication into extras), then runs mini-swe-agent and grades
predictions with the swebench evaluation harness, reporting
resolved_instances / submitted_instances as the accuracy score.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
swebench.harness.run_evaluation writes its result JSON to its CWD.
The eval subprocess was running with cwd=self.mini_swe_agent_dir,
scattering files in the user's external venv directory. Change CWD to
output_dir (report_dir/swe_bench_output/) so all outputs are
self-contained under the report directory. Update the result file
lookup accordingly and fix the test fixture to match.

Add examples/10_SWEBench_Example/accuracy/RUNBOOK.md with venv setup
and smoke-test instructions, mirroring the VBench runbook pattern.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the external mini-swe-agent venv model with an in-repo uv
subproject at examples/10_SWEBench_Example/accuracy/ (pyproject.toml +
uv.lock), mirroring how VBenchScorer isolates vbench dependencies.

- Add accuracy/pyproject.toml pinning mini-swe-agent==2.3.0 and
  swebench==4.1.0 with package=false; generate uv.lock
- Rename _DEFAULT_MINI_SWE_AGENT_DIR/_MINI_SWE_AGENT_DIR_ENV to
  _DEFAULT_SWE_BENCH_PROJECT_PATH/_SWE_BENCH_PROJECT_PATH_ENV; default
  now points into the repo (examples/10_SWEBench_Example/accuracy)
- _run_subprocess now wraps commands with `uv run --project <path>`
  instead of manually activating a .venv; drop manual env patching
- Init check now validates pyproject.toml presence, not .venv/bin/python
- Update RUNBOOK to reflect `uv sync` setup; remove test-dev paths
- Update swe_bench_accuracy.yaml: remove hardcoded /home/user path,
  rename key to swe_bench_project_path
- Update tests: rename fixture mini_swe_dir → swe_bench_project,
  parameter mini_swe_agent_dir → swe_bench_project_path
- Exclude accuracy subproject uv.lock files from the large-file hook

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Minimal end-to-end smoke config (5 perf samples, 1 accuracy instance)
for validating the SWEBenchScorer pipeline without running a full
600-second benchmark. Uses a 1-row JSONL for the accuracy phase to
avoid issuing 500 predefined-dataset requests to the endpoint before
the scorer runs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a multi-turn + SWE-bench accuracy smoke config
(swe_bench_multiturn_smoke.yaml) and its minimal 2-conversation perf
dataset. Confirmed end-to-end: multi-turn perf phase completes, then
SWEBenchScorer runs mini-swe-agent + harness and returns a score
without errors (exit 0).

Also adds a comment to swe_bench_accuracy.yaml noting that the perf
dataset can be replaced with a multi-turn dataset without other changes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…phase

Adds Dataset.ACCURACY_ONLY class variable (default False). SWEBench sets
it to True — problem statements sent directly to the model without an agent
framework don't reflect real SWE-bench usage, so using the predefined
dataset as a performance dataset is now rejected with InputValidationError.

The check in _load_datasets() fires before create_loader() via a PREDEFINED
lookup, so it gives a clear error rather than a confusing downstream failure.

Updates swe_bench_accuracy.yaml, swe_bench_accuracy_smoke.yaml, and
swe_bench_multiturn_smoke.yaml to use an explicit JSONL for the perf phase.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Replace two-branch subset ternary with _SWE_BENCH_HF_MAP dict lookup;
  validate subset in __init__ so unknown values (e.g. "full") raise
  ValueError at construction time rather than silently scoring against
  the wrong dataset.
- Hoist `import yaml` to module top level; declare pyyaml==6.0.3 in
  pyproject.toml (was already a de-facto transitive dep, now explicit).
- Add msgspec.json.decode(..., type=dict) on harness result file so a
  non-dict JSON response raises DecodeError instead of AttributeError.
- Validate swebench template schema at construction: raise ValueError if
  model.model_kwargs is missing, surfacing bad templates early.
- Add unit tests for all four fixes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Scorer.preflight() class method hook — no-op by default — is called in
_load_datasets for every accuracy scorer before the perf phase begins.
SWEBenchScorer overrides it to verify:
  1. uv is on PATH
  2. mini-extra is runnable in the accuracy subproject (uv sync was run)
  3. swebench is importable in the subproject
  4. Docker daemon is running

All four raise SetupError with actionable messages so a misconfigured
environment is caught upfront rather than after a potentially long
performance run.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Li, Tianmu <tianmu.li@intel.com>
num_instances=100, workers=10, max_eval_workers=10, subset="verified"
reflect the intended accuracy evaluation target out of the box.
Removes the None fallback for num_instances and cleans up redundant
extras from YAML examples.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add four tests to match VBenchScorer coverage depth:
- subprocess non-zero exit code raises RuntimeError
- subprocess timeout raises RuntimeError
- result file found via glob fallback when exact name absent
- submitted_instances=0 guard returns None score

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
VBench does not have smoke variants; these files used dummy perf data
and a 1-instance accuracy run that users can achieve via --extras
num_instances=1 on the CLI. No other files reference them.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Scorer.preflight: collapse to a one-liner (matches available_scorers style)
- SWEBenchScorer class docstring: drop numbered step walkthrough and extras
  parameter table; keep the what and the uv-run isolation rationale

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove test_score_result_non_dict_raises_decode_error — asserts on
  msgspec behavior, not SWEBenchScorer logic
- Merge subset pair into test_subset_maps_to_correct_hf_dataset_name
  with parametrize
- Merge slice pair into test_num_instances_produces_correct_slice
  with parametrize

20 test functions, 22 pytest cases (was 23 functions, 23 cases)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…dir to absolute

SWEBenchScorer delegates entirely to mini-extra and never reads endpoint
responses, so the accuracy endpoint phase was sending 500 samples through
the endpoint for nothing. Add SKIP_ENDPOINT_PHASE class variable to Scorer
(default False, True on SWEBenchScorer) and guard _build_phases() and the
total_samples count with it.

Also add external_sample_count() classmethod so scorers that skip the
endpoint phase can still surface their sample count in the setup log.
SWEBenchScorer returns num_instances from extras.

Fix a second bug where a relative report_dir (e.g. logs/foo launched from
repo root) caused mini-extra to fail with FileNotFoundError on the patched
config YAML because all derived paths were relative and mini-extra runs
with cwd=output_dir. Resolving self.report_dir to absolute in
SWEBenchScorer.__init__ fixes all derived paths at once.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
SWEBenchScorer never uses sample_index_map (it scores via mini-extra
subprocess), but Scorer.__init__ unconditionally called _load_sample_index_map,
which KeyErrors when the accuracy phase was skipped and 'swe_bench' was
therefore absent from sample_idx_map.json.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- Restore corrupted Apache 2.0 license header in scoring.py
- Move import yaml into sorted third-party block (ruff I001)
- Add pass body and clarify noqa comment on base Scorer.preflight
- Add type annotation and assert guards for sample_index_map (fixes mypy)
- Change zip(strict=False) to strict=True for accuracy_datasets/eval_configs
- Move _SelfContainedScorer from inside test method to module level; remove lazy imports
- Filter _-prefixed test-only scorers from TestScorerMethodSync registry check
- Add test_external_sample_count covering all branches of SWEBenchScorer.external_sample_count

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
SWE-bench is the accuracy evaluation for the multi-turn agentic workload,
so its config, agent template, accuracy subproject, and runbook belong in
09_MultiTurn/ rather than a separate 10_SWEBench_Example/ folder.

- Rename examples/10_SWEBench_Example/ → examples/09_MultiTurn/ (4 files)
- Update _DEFAULT_SWE_BENCH_PROJECT_PATH and _DEFAULT_SWE_BENCH_TEMPLATE
  constants in scoring.py to point at the new location
- Update all self-referential paths in swe_bench_accuracy.yaml,
  RUNBOOK.md, and pyproject.toml
- Add SWE-bench Accuracy section to examples/09_MultiTurn/README.md

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…accuracy.yaml

dummy_1k.jsonl uses the column name text_input; the OpenAI adapter's
ColumnFilter requires prompt. Without the parser remap the benchmark
fails at dataset load with a KeyError on the prompt column.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@tianmu-li tianmu-li force-pushed the feat/swe_bench_scorer branch from 89af1d1 to 6016c01 Compare June 8, 2026 22:52
tianmu-li and others added 2 commits June 8, 2026 22:53
Signed-off-by: Li, Tianmu <tianmu.li@intel.com>
The branch introduced blank lines that violated ruff I001 in both files.
Aligning with the main-branch import layout so pre-commit passes.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant