Add support for auto populating system description and run metadata #327
Add support for auto populating system description and run metadata #327anandhu-eng wants to merge 24 commits into
Conversation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
There was a problem hiding this comment.
Code Review
This pull request introduces a system information capture feature (sysinfo) using mlcflow to collect hardware, software, and serving configurations from multi-node environments, both as a standalone CLI command and integrated into benchmark finalization. It also fixes a bug in ColumnFilter where repeated calls mutated required_columns. The review feedback highlights several key issues: the default output filename in capture.py should be updated to match the design documentation and tests; capture_system_info needs to check new_env for the output path; there is a format discrepancy between the written run_metadata.json and the documented run_metadata.yml; and mlcflow should be moved to optional dependencies in pyproject.toml to align with its optional usage in the code.
| "pytz==2026.1.post1", | ||
| "urllib3==2.7.0", | ||
| # MLCFlow for system info | ||
| "mlcflow", |
There was a problem hiding this comment.
mlcflow has been added to the core dependencies list. However, comments in capture.py and execute.py state that mlcflow is an optional dependency only needed when system_info is configured. If it is indeed intended to be optional, it should be moved to [project.optional-dependencies] (e.g., under a sys-info extra) to avoid forcing all users to install it.
There was a problem hiding this comment.
I think it would be better to have the mlc-script dependency in the core list, as it is one of the main tasks, with the flexibility to opt out. That said, I am happy to move it to an optional dependency as well.
- Remove unused `import yaml` from benchmark/execute.py - Rename mlcflow → mlc-scripts in pyproject.toml and uv.lock; update capture.py comment and SetupError install hint accordingly - Update DESIGN.md: run_metadata.yml → run_metadata.json, mlperf-multi-node-system-info.json → system_desc.json - Fix test_capture.py assertions to match actual out_file_name (system_desc.json) and updated SetupError message (mlc-scripts) - Regenerate _full config templates (system_info field now visible) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- BenchmarkConfig field was named system_info but tests and the intended YAML API used sys_info_capture; rename throughout schema.py, execute.py, and the endpoint-url propagation validator - Fix capture_system_info return path to use MLC_MULTI_NODE_SYSTEM_INFO_FILE_PATH from new_env when present, falling back to output_path/system_desc.json - Update fake_capture stubs in test_sysinfo_command.py to accept run_metadata_path kwarg passed by the sysinfo CLI - Regenerate _full config templates after schema field rename Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Fail with ExecutionError when MLC_MULTI_NODE_SYSTEM_INFO_FILE_PATH is not returned by mlcflow (replaces silent fallback to default path) - Test that missing ssh_ids in YAML raises ValidationError regardless of exclude_current_system value - Test that sys info failure (ExecutionError or unexpected exception) does not block results.json from being written in finalize_benchmark - Test unreachable node and node_config count mismatch scenarios - Replace internal hostname in test fixtures with generic user@10.0.0.1 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nfig Renames the `sys_info_capture` YAML key to `system_info` in BenchmarkConfig for consistency with SysInfoFileConfig which already uses `system_info`. Updates all Python references, config templates, tests, and docs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Re-run scripts/regenerate_templates.py after merging main (fbe543f drain timeout changes) into sysinfochanges. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
arekay-nv
left a comment
There was a problem hiding this comment.
Review Council — Multi-AI Code Review
Reviewed by: Codex (gpt-5.4) + Claude | Depth: thorough
10 inline findings posted across the sysinfo / run-metadata changes. See the summary comment for the tiered breakdown and the design/performance/deployment impact analysis.
Review Council — Multi-AI Code ReviewReviewed by: Codex (gpt-5.4) + Claude | Depth: thorough Found 10 issues across 5 files. All line numbers verified against source at HEAD ( 🔴 Must Fix (high)
🟡 Should Fix (medium)
🔵 Consider (low)
Change summary & impactWhat the PR does. Adds auto-population of MLPerf system description + run metadata. New pieces: a standalone Design. Additive and largely clean. Performance. No hot-path impact. All new code runs in finalization or in the standalone Deployment / backward-compatibility (the explicit focus). Existing
The one real deployment regression is #2: |
- Fix percentile key mismatch in _build_run_metadata: registry stores
str(float) keys ("99.0", "50.0") but lookups used integer strings
("99", "50"), causing all p50/p90/p95/p99 fields to be null
- Guard run_metadata.json write in try/except and move _build_run_metadata
inside the guard so results.json is always written first
- Use open() instead of Path.open() for run_metadata.json write to make
builtins.open patching work correctly in tests
- Move _build_run_metadata call to just before run_metadata.json write
so any future raise does not abort finalization before results.json
- Add zero-guard to tps_per_user division (concurrency > 0)
- Use datetime.now(UTC).isoformat() instead of datetime.now().isoformat()
for unambiguous UTC timestamps in run_metadata.json
- Remove _propagate_endpoint_url_to_sysinfo from BenchmarkConfig:
endpoints[0] is the load target not necessarily the serving node
- Add port range validation (1-65535) to serving_node field validator,
matching the check already present in _validate_ssh_ids
- Pin mlc-scripts==1.1.0 in pyproject.toml
- Add tests covering all the above fixes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Hi @arekay-nv @nvzhihanj, would you be able to review and merge this if everything looks good? |
arekay-nv
left a comment
There was a problem hiding this comment.
Review Council — Multi-AI Code Review
Reviewed by: Claude (Codex unavailable — enterprise policy block) | Depth: thorough
Found 9 issues across multiple files. See inline comments for details.
| if ctx.config.system_info is not None: | ||
| try: | ||
| # Local import: mlcflow is optional and only needed when system_info is configured. | ||
| from inference_endpoint.sys_info.capture import ( |
There was a problem hiding this comment.
[Claude] medium · design — Lazy import violates project rules and the SetupError is dead code.
# Local import: mlcflow is optional and only needed when system_info is configured.
from inference_endpoint.sys_info.capture import (
capture_system_info,
)AGENTS.md only permits lazy imports for three reasons: circular-import avoidance, optional deps via a top-level try/except, or security sandboxing. None apply here: mlc-scripts==1.1.0 is in the core dependencies array in pyproject.toml, so the ImportError guard in capture.py is unreachable. Move the import to the module top level.
| output_path = capture_system_info( | ||
| ctx.config.system_info, | ||
| output_dir=ctx.report_dir, | ||
| run_metadata_path=ctx.report_dir / "run_metadata.json", |
There was a problem hiding this comment.
[Claude] high · data-integrity — run_metadata.json path is passed to mlcflow unconditionally even if the file was never written.
run_metadata_path=ctx.report_dir / "run_metadata.json",If the open() write earlier in the try block raises OSError (disk full, permissions), the except Exception block logs the error and execution continues. The path handed to capture_system_info then points to a non-existent file. mlcflow will either error or produce silently incomplete output.
The standalone sysinfo CLI correctly guards with if candidate.exists(). Apply the same guard here:
metadata_path = ctx.report_dir / "run_metadata.json"
run_metadata_path=metadata_path if metadata_path.exists() else None,| ssh_ids: | ||
| - user@host | ||
| accelerator_backend: cuda | ||
| output_path: /tmp/sys_info |
There was a problem hiding this comment.
[Claude] medium · api-contract — Docstring example uses a non-existent field output_path.
system_info:
output_path: /tmp/sys_info # this field does not existSysInfoCaptureConfig has no output_path field — output location is controlled by the top-level report_dir field of SysInfoFileConfig. Because SysInfoCaptureConfig uses extra="forbid", a user following this example gets a ValidationError at runtime with no obvious hint about the correct field name. Remove output_path from the example and add a top-level report_dir key instead.
| """Mirror the tag-building logic from capture.py.""" | ||
| tags: list[str] = [ | ||
| "get-mlperf-multi-node-system-info", | ||
| f"_{cfg.accelerator_backend}", |
There was a problem hiding this comment.
[Claude] medium · testing — _build_tags test helper diverges from the real implementation for accelerator_backend="none".
Helper (here):
f"_{cfg.accelerator_backend}", # always appendedReal code (capture.py):
if config.accelerator_backend \!= "none":
tags.append(f"_{config.accelerator_backend}")For accelerator_backend="none", the helper produces get-mlperf-multi-node-system-info,_none while the real function produces get-mlperf-multi-node-system-info. Since no existing test case uses "none", this divergence is invisible. Add a test case for accelerator_backend="none" and fix the helper to match the implementation.
| "pytz==2026.1.post1", | ||
| "urllib3==2.7.0", | ||
| # mlc-scripts for system info | ||
| "mlc-scripts==1.1.0", |
There was a problem hiding this comment.
[Claude] medium · design — mlc-scripts in core deps inflates every install and transitively pulls in requests.
"mlc-scripts==1.1.0",This adds requests and giturlparse as transitive dependencies for every user of the benchmark tool, even those who never use system_info. AGENTS.md explicitly prohibits requests from production code. Consider moving mlc-scripts to an optional group (e.g. [project.optional-dependencies] sysinfo = ["mlc-scripts==1.1.0"]) so users opt in, which would also make the ImportError guard in capture.py live rather than dead code.
| } | ||
| } | ||
| } | ||
| fd, path = tempfile.mkstemp(suffix=".yaml", prefix="mlperf_node_cfg_") |
There was a problem hiding this comment.
[Claude] low · bug — Temp file leak if yaml.dump raises after mkstemp.
fd, path = tempfile.mkstemp(suffix=".yaml", prefix="mlperf_node_cfg_")
with os.fdopen(fd, "w") as fh:
yaml.dump(data, fh, default_flow_style=False)
return pathIf yaml.dump raises, the with block closes the fd but path is never returned, so node_config_tmp stays None in the caller and the finally block's os.unlink guard won't fire. Wrap with try/except to unlink on error:
fd, path = tempfile.mkstemp(suffix=".yaml", prefix="mlperf_node_cfg_")
try:
with os.fdopen(fd, "w") as fh:
yaml.dump(data, fh, default_flow_style=False)
except Exception:
os.unlink(path)
raise
return path| "log-based framework detection when endpoint_url is unavailable." | ||
| ), | ||
| ) | ||
| log_path: str | None = Field( |
There was a problem hiding this comment.
[Claude] low · api-contract — No cross-field validation between serving_node and log_path.
log_path: str | None = Field(default=None, ...)A user who sets log_path but omits serving_node silently passes the path to mlcflow without a target host, likely producing confusing or incomplete output. A model_validator checking log_path is None or serving_node is not None would catch this at config-load time with a clear error message.
| try: | ||
| import mlc # noqa: PLC0415 | ||
| except ImportError as exc: | ||
| raise SetupError( |
There was a problem hiding this comment.
[Claude] low · error-handling — ImportError handler is unreachable dead code.
try:
import mlc # noqa: PLC0415
except ImportError as exc:
raise SetupError("mlc-scripts is required...") from exc # never reachedmlc-scripts==1.1.0 is in core dependencies, so ImportError can never be raised in a correctly installed environment. This creates a misleading signal that the feature is optional-install-gated. See also: related comment on pyproject.toml:77 about moving mlc-scripts to an optional group, which would make this guard meaningful.
| "inference_endpoint.commands.benchmark.execute._build_run_metadata", | ||
| return_value={}, | ||
| ), | ||
| patch("builtins.open", side_effect=OSError("disk full")), |
There was a problem hiding this comment.
[Claude] low · testing — Global builtins.open patch blocks all writes, including results.json.
patch("builtins.open", side_effect=OSError("disk full")),The test only asserts no exception propagates, but because results.json is also blocked, it doesn't verify the write-ordering invariant (results.json written before run_metadata.json). The sibling test test_results_json_written_even_if_metadata_write_fails does this correctly with selective patching. Consider renaming this test to test_disk_full_does_not_abort_finalize to clarify its narrower scope.
Review Council — Multi-AI Code ReviewReviewed by: Claude (Codex unavailable — enterprise policy block) | Depth: thorough Found 9 issues across 7 files. 🔴 Must Fix (high)
🟡 Should Fix (medium)
🔵 Consider (low)
|
What does this PR do?
This PR adds automated system info capture for MLPerf inference submissions. There are two ways to trigger it:
Standalone (inference-endpoint sysinfo from-config --config benchmark_config.yaml): collects hardware, software info and serving config from one or more remote nodes via SSH and writes a system_desc.json suitable for endpoints, independent of any benchmark run.
Integrated(if system_info block is included in the benchmark_config.yml when running the benchmark): runs automatically after a benchmark completes. In addition to collecting the same hardware info, it also extracts serving configuration (tensor/pipeline/expert parallelism, batch size, framework version) from the inference server's startup log or via HTTP probe, and patches those values into run_metadata.json.
Both paths SSH into each target node to run the get-mlperf-multi-node-system-info mlcflow script, merge the per-node results into a single JSON, and optionally validate the topology against a declared node_config (Eg; Prefill/Decode node groupings). Failures in the integrated path are non-blocking, results.json and run_metadata.json are always written first.
Type of change
Related issues
Testing
Checklist