WIP feat: swe bench scorer#342
Conversation
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
There was a problem hiding this comment.
Code Review
This pull request introduces support for SWE-bench accuracy evaluation by adding a new accuracy-only SWEBench dataset, a SWEBenchScorer that runs evaluations using mini-swe-agent in an isolated environment, and associated configuration templates, tests, and runbooks. Feedback on the changes focuses on improving the robustness of the SWEBenchScorer implementation, specifically by safely handling missing or null values when parsing configuration templates, benchmark configurations, and evaluation results, as well as gracefully handling cases where the Docker binary is missing from the system's PATH during preflight checks.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| docker_result = subprocess.run( | ||
| ["docker", "version"], | ||
| stdout=subprocess.DEVNULL, | ||
| stderr=subprocess.PIPE, | ||
| timeout=10, | ||
| ) | ||
| if docker_result.returncode != 0: | ||
| raise SetupError( | ||
| "Docker daemon is not running or docker is not on PATH. " | ||
| "Start Docker and retry." | ||
| ) |
There was a problem hiding this comment.
If docker is not installed or not present on the system's PATH, calling subprocess.run(["docker", ...]) will raise a FileNotFoundError rather than returning a non-zero exit code. This will cause the preflight check to crash with an unhandled traceback. Checking if docker is on PATH using shutil.which and wrapping the execution in a try-except block ensures a clean SetupError is raised.
| docker_result = subprocess.run( | |
| ["docker", "version"], | |
| stdout=subprocess.DEVNULL, | |
| stderr=subprocess.PIPE, | |
| timeout=10, | |
| ) | |
| if docker_result.returncode != 0: | |
| raise SetupError( | |
| "Docker daemon is not running or docker is not on PATH. " | |
| "Start Docker and retry." | |
| ) | |
| if shutil.which("docker") is None: | |
| raise SetupError( | |
| "docker is not on PATH. Install Docker and retry." | |
| ) | |
| try: | |
| docker_result = subprocess.run( | |
| ["docker", "version"], | |
| stdout=subprocess.DEVNULL, | |
| stderr=subprocess.PIPE, | |
| timeout=10, | |
| ) | |
| except Exception as e: | |
| raise SetupError(f"Failed to execute docker command: {e}") | |
| if docker_result.returncode != 0: | |
| raise SetupError( | |
| "Docker daemon is not running. Start Docker and retry." | |
| ) |
| with self.swebench_config_template.open() as _f: | ||
| _tmpl = yaml.safe_load(_f) | ||
| if not isinstance((_tmpl or {}).get("model", {}).get("model_kwargs"), dict): | ||
| raise ValueError( | ||
| f"swebench template {self.swebench_config_template} must have a " | ||
| "'model.model_kwargs' dict; check the template structure." | ||
| ) |
There was a problem hiding this comment.
The validation of the swebench_config_template structure is fragile and can raise an AttributeError or TypeError if _tmpl is parsed as None (e.g., an empty file), or if "model" is not a dictionary (e.g., a string or None). Using a safer dictionary retrieval pattern prevents these potential runtime crashes.
| with self.swebench_config_template.open() as _f: | |
| _tmpl = yaml.safe_load(_f) | |
| if not isinstance((_tmpl or {}).get("model", {}).get("model_kwargs"), dict): | |
| raise ValueError( | |
| f"swebench template {self.swebench_config_template} must have a " | |
| "'model.model_kwargs' dict; check the template structure." | |
| ) | |
| with self.swebench_config_template.open() as _f: | |
| _tmpl = yaml.safe_load(_f) or {} | |
| model_cfg = _tmpl.get("model") | |
| if not isinstance(model_cfg, dict) or not isinstance(model_cfg.get("model_kwargs"), dict): | |
| raise ValueError( | |
| f"swebench template {self.swebench_config_template} must have a " | |
| "'model.model_kwargs' dict; check the template structure." | |
| ) |
| model_params = benchmark_config_dict.get("model_params", {}) | ||
| endpoints = benchmark_config_dict.get("endpoint_config", {}).get( | ||
| "endpoints", [] | ||
| ) |
There was a problem hiding this comment.
If endpoint_config or model_params is defined as empty or null in the YAML configuration, benchmark_config_dict.get(...) can return None. Calling .get() on None will raise an AttributeError. Using the or {} pattern ensures safe dictionary access.
model_params = benchmark_config_dict.get("model_params") or {}
endpoint_cfg = benchmark_config_dict.get("endpoint_config") or {}
endpoints = endpoint_cfg.get("endpoints", [])| submitted = result.get("submitted_instances", 0) | ||
| resolved = result.get("resolved_instances", 0) |
There was a problem hiding this comment.
If the SWE-bench evaluation results JSON contains null values for submitted_instances or resolved_instances, .get(..., 0) will return None instead of 0. This will subsequently cause a TypeError when calculating the resolved rate. Using or 0 ensures that any None values are safely defaulted to 0.
| submitted = result.get("submitted_instances", 0) | |
| resolved = result.get("resolved_instances", 0) | |
| submitted = result.get("submitted_instances") or 0 | |
| resolved = result.get("resolved_instances") or 0 |
Adds SWEBenchScorer (scorer_id="swe_bench_scorer") and a predefined SWEBench dataset that downloads from princeton-nlp/SWE-bench_Verified or SWE-bench_Lite on HuggingFace. The scorer patches a committed swebench_template.yaml with model name, endpoint URL, and all sampling parameters from model_params at runtime (no duplication into extras), then runs mini-swe-agent and grades predictions with the swebench evaluation harness, reporting resolved_instances / submitted_instances as the accuracy score. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
swebench.harness.run_evaluation writes its result JSON to its CWD. The eval subprocess was running with cwd=self.mini_swe_agent_dir, scattering files in the user's external venv directory. Change CWD to output_dir (report_dir/swe_bench_output/) so all outputs are self-contained under the report directory. Update the result file lookup accordingly and fix the test fixture to match. Add examples/10_SWEBench_Example/accuracy/RUNBOOK.md with venv setup and smoke-test instructions, mirroring the VBench runbook pattern. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the external mini-swe-agent venv model with an in-repo uv subproject at examples/10_SWEBench_Example/accuracy/ (pyproject.toml + uv.lock), mirroring how VBenchScorer isolates vbench dependencies. - Add accuracy/pyproject.toml pinning mini-swe-agent==2.3.0 and swebench==4.1.0 with package=false; generate uv.lock - Rename _DEFAULT_MINI_SWE_AGENT_DIR/_MINI_SWE_AGENT_DIR_ENV to _DEFAULT_SWE_BENCH_PROJECT_PATH/_SWE_BENCH_PROJECT_PATH_ENV; default now points into the repo (examples/10_SWEBench_Example/accuracy) - _run_subprocess now wraps commands with `uv run --project <path>` instead of manually activating a .venv; drop manual env patching - Init check now validates pyproject.toml presence, not .venv/bin/python - Update RUNBOOK to reflect `uv sync` setup; remove test-dev paths - Update swe_bench_accuracy.yaml: remove hardcoded /home/user path, rename key to swe_bench_project_path - Update tests: rename fixture mini_swe_dir → swe_bench_project, parameter mini_swe_agent_dir → swe_bench_project_path - Exclude accuracy subproject uv.lock files from the large-file hook Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Minimal end-to-end smoke config (5 perf samples, 1 accuracy instance) for validating the SWEBenchScorer pipeline without running a full 600-second benchmark. Uses a 1-row JSONL for the accuracy phase to avoid issuing 500 predefined-dataset requests to the endpoint before the scorer runs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a multi-turn + SWE-bench accuracy smoke config (swe_bench_multiturn_smoke.yaml) and its minimal 2-conversation perf dataset. Confirmed end-to-end: multi-turn perf phase completes, then SWEBenchScorer runs mini-swe-agent + harness and returns a score without errors (exit 0). Also adds a comment to swe_bench_accuracy.yaml noting that the perf dataset can be replaced with a multi-turn dataset without other changes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…phase Adds Dataset.ACCURACY_ONLY class variable (default False). SWEBench sets it to True — problem statements sent directly to the model without an agent framework don't reflect real SWE-bench usage, so using the predefined dataset as a performance dataset is now rejected with InputValidationError. The check in _load_datasets() fires before create_loader() via a PREDEFINED lookup, so it gives a clear error rather than a confusing downstream failure. Updates swe_bench_accuracy.yaml, swe_bench_accuracy_smoke.yaml, and swe_bench_multiturn_smoke.yaml to use an explicit JSONL for the perf phase. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Replace two-branch subset ternary with _SWE_BENCH_HF_MAP dict lookup; validate subset in __init__ so unknown values (e.g. "full") raise ValueError at construction time rather than silently scoring against the wrong dataset. - Hoist `import yaml` to module top level; declare pyyaml==6.0.3 in pyproject.toml (was already a de-facto transitive dep, now explicit). - Add msgspec.json.decode(..., type=dict) on harness result file so a non-dict JSON response raises DecodeError instead of AttributeError. - Validate swebench template schema at construction: raise ValueError if model.model_kwargs is missing, surfacing bad templates early. - Add unit tests for all four fixes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Scorer.preflight() class method hook — no-op by default — is called in _load_datasets for every accuracy scorer before the perf phase begins. SWEBenchScorer overrides it to verify: 1. uv is on PATH 2. mini-extra is runnable in the accuracy subproject (uv sync was run) 3. swebench is importable in the subproject 4. Docker daemon is running All four raise SetupError with actionable messages so a misconfigured environment is caught upfront rather than after a potentially long performance run. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Li, Tianmu <tianmu.li@intel.com>
num_instances=100, workers=10, max_eval_workers=10, subset="verified" reflect the intended accuracy evaluation target out of the box. Removes the None fallback for num_instances and cleans up redundant extras from YAML examples. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add four tests to match VBenchScorer coverage depth: - subprocess non-zero exit code raises RuntimeError - subprocess timeout raises RuntimeError - result file found via glob fallback when exact name absent - submitted_instances=0 guard returns None score Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
VBench does not have smoke variants; these files used dummy perf data and a 1-instance accuracy run that users can achieve via --extras num_instances=1 on the CLI. No other files reference them. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Scorer.preflight: collapse to a one-liner (matches available_scorers style) - SWEBenchScorer class docstring: drop numbered step walkthrough and extras parameter table; keep the what and the uv-run isolation rationale Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove test_score_result_non_dict_raises_decode_error — asserts on msgspec behavior, not SWEBenchScorer logic - Merge subset pair into test_subset_maps_to_correct_hf_dataset_name with parametrize - Merge slice pair into test_num_instances_produces_correct_slice with parametrize 20 test functions, 22 pytest cases (was 23 functions, 23 cases) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…dir to absolute SWEBenchScorer delegates entirely to mini-extra and never reads endpoint responses, so the accuracy endpoint phase was sending 500 samples through the endpoint for nothing. Add SKIP_ENDPOINT_PHASE class variable to Scorer (default False, True on SWEBenchScorer) and guard _build_phases() and the total_samples count with it. Also add external_sample_count() classmethod so scorers that skip the endpoint phase can still surface their sample count in the setup log. SWEBenchScorer returns num_instances from extras. Fix a second bug where a relative report_dir (e.g. logs/foo launched from repo root) caused mini-extra to fail with FileNotFoundError on the patched config YAML because all derived paths were relative and mini-extra runs with cwd=output_dir. Resolving self.report_dir to absolute in SWEBenchScorer.__init__ fixes all derived paths at once. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
SWEBenchScorer never uses sample_index_map (it scores via mini-extra subprocess), but Scorer.__init__ unconditionally called _load_sample_index_map, which KeyErrors when the accuracy phase was skipped and 'swe_bench' was therefore absent from sample_idx_map.json. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- Restore corrupted Apache 2.0 license header in scoring.py - Move import yaml into sorted third-party block (ruff I001) - Add pass body and clarify noqa comment on base Scorer.preflight - Add type annotation and assert guards for sample_index_map (fixes mypy) - Change zip(strict=False) to strict=True for accuracy_datasets/eval_configs - Move _SelfContainedScorer from inside test method to module level; remove lazy imports - Filter _-prefixed test-only scorers from TestScorerMethodSync registry check - Add test_external_sample_count covering all branches of SWEBenchScorer.external_sample_count Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
SWE-bench is the accuracy evaluation for the multi-turn agentic workload, so its config, agent template, accuracy subproject, and runbook belong in 09_MultiTurn/ rather than a separate 10_SWEBench_Example/ folder. - Rename examples/10_SWEBench_Example/ → examples/09_MultiTurn/ (4 files) - Update _DEFAULT_SWE_BENCH_PROJECT_PATH and _DEFAULT_SWE_BENCH_TEMPLATE constants in scoring.py to point at the new location - Update all self-referential paths in swe_bench_accuracy.yaml, RUNBOOK.md, and pyproject.toml - Add SWE-bench Accuracy section to examples/09_MultiTurn/README.md Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…accuracy.yaml dummy_1k.jsonl uses the column name text_input; the OpenAI adapter's ColumnFilter requires prompt. Without the parser remap the benchmark fails at dataset load with a KeyError on the prompt column. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
89af1d1 to
6016c01
Compare
Signed-off-by: Li, Tianmu <tianmu.li@intel.com>
The branch introduced blank lines that violated ruff I001 in both files. Aligning with the main-branch import layout so pre-commit passes. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
What does this PR do?
Type of change
Related issues
Testing
Checklist