Add support for auto populating system description and run metadata by anandhu-eng · Pull Request #327 · mlcommons/endpoints

anandhu-eng · 2026-05-28T21:40:11Z

What does this PR do?

This PR adds automated system info capture for MLPerf inference submissions. There are two ways to trigger it:

Standalone (inference-endpoint sysinfo from-config --config benchmark_config.yaml): collects hardware, software info and serving config from one or more remote nodes via SSH and writes a system_desc.json suitable for endpoints, independent of any benchmark run.
Integrated(if system_info block is included in the benchmark_config.yml when running the benchmark): runs automatically after a benchmark completes. In addition to collecting the same hardware info, it also extracts serving configuration (tensor/pipeline/expert parallelism, batch size, framework version) from the inference server's startup log or via HTTP probe, and patches those values into run_metadata.json.

Both paths SSH into each target node to run the get-mlperf-multi-node-system-info mlcflow script, merge the per-node results into a single JSON, and optionally validate the topology against a declared node_config (Eg; Prefill/Decode node groupings). Failures in the integrated path are non-blocking, results.json and run_metadata.json are always written first.

Type of change

Bug fix
New feature
Documentation update
Refactor/cleanup

Related issues

Testing

Tests added/updated
All tests pass locally
Manual testing completed

Checklist

Code follows project style
Pre-commit hooks pass
Documentation updated (if needed)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions · 2026-05-28T21:40:19Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

gemini-code-assist

Code Review

This pull request introduces a system information capture feature (sysinfo) using mlcflow to collect hardware, software, and serving configurations from multi-node environments, both as a standalone CLI command and integrated into benchmark finalization. It also fixes a bug in ColumnFilter where repeated calls mutated required_columns. The review feedback highlights several key issues: the default output filename in capture.py should be updated to match the design documentation and tests; capture_system_info needs to check new_env for the output path; there is a format discrepancy between the written run_metadata.json and the documented run_metadata.yml; and mlcflow should be moved to optional dependencies in pyproject.toml to align with its optional usage in the code.

gemini-code-assist · 2026-05-28T21:41:56Z

    "pytz==2026.1.post1",
    "urllib3==2.7.0",
+    # MLCFlow for system info
+    "mlcflow",


mlcflow has been added to the core dependencies list. However, comments in capture.py and execute.py state that mlcflow is an optional dependency only needed when system_info is configured. If it is indeed intended to be optional, it should be moved to [project.optional-dependencies] (e.g., under a sys-info extra) to avoid forcing all users to install it.

I think it would be better to have the mlc-script dependency in the core list, as it is one of the main tasks, with the flexibility to opt out. That said, I am happy to move it to an optional dependency as well.

- Remove unused `import yaml` from benchmark/execute.py - Rename mlcflow → mlc-scripts in pyproject.toml and uv.lock; update capture.py comment and SetupError install hint accordingly - Update DESIGN.md: run_metadata.yml → run_metadata.json, mlperf-multi-node-system-info.json → system_desc.json - Fix test_capture.py assertions to match actual out_file_name (system_desc.json) and updated SetupError message (mlc-scripts) - Regenerate _full config templates (system_info field now visible) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- BenchmarkConfig field was named system_info but tests and the intended YAML API used sys_info_capture; rename throughout schema.py, execute.py, and the endpoint-url propagation validator - Fix capture_system_info return path to use MLC_MULTI_NODE_SYSTEM_INFO_FILE_PATH from new_env when present, falling back to output_path/system_desc.json - Update fake_capture stubs in test_sysinfo_command.py to accept run_metadata_path kwarg passed by the sysinfo CLI - Regenerate _full config templates after schema field rename Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Fail with ExecutionError when MLC_MULTI_NODE_SYSTEM_INFO_FILE_PATH is not returned by mlcflow (replaces silent fallback to default path) - Test that missing ssh_ids in YAML raises ValidationError regardless of exclude_current_system value - Test that sys info failure (ExecutionError or unexpected exception) does not block results.json from being written in finalize_benchmark - Test unreachable node and node_config count mismatch scenarios - Replace internal hostname in test fixtures with generic user@10.0.0.1 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…nfig Renames the `sys_info_capture` YAML key to `system_info` in BenchmarkConfig for consistency with SysInfoFileConfig which already uses `system_info`. Updates all Python references, config templates, tests, and docs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Re-run scripts/regenerate_templates.py after merging main (fbe543f drain timeout changes) into sysinfochanges. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

arekay-nv

Review Council — Multi-AI Code Review

Reviewed by: Codex (gpt-5.4) + Claude | Depth: thorough

10 inline findings posted across the sysinfo / run-metadata changes. See the summary comment for the tiered breakdown and the design/performance/deployment impact analysis.

arekay-nv · 2026-06-04T14:08:52Z

Review Council — Multi-AI Code Review

Reviewed by: Codex (gpt-5.4) + Claude | Depth: thorough

Found 10 issues across 5 files. All line numbers verified against source at HEAD (8d99b66); two topics already raised by existing bots (mlcflow core-vs-extra, run_metadata JSON/YAML doc mismatch) were de-duplicated.

🔴 Must Fix (high)

#	File	Line	Category	Reviewer(s)	Summary
1	`commands/benchmark/execute.py`	1100	data-integrity	Both	`_pct` looks up `"99"/"95"/"90"/"50"` but registry keys are float-strings (`"99.0"`…); `ttft` + all `p50/p90/p95/p99` fields write `null`. Only `*_p999` works.
2	`pyproject.toml`	77	design	Claude	`mlc-scripts` added with no `==` pin — violates AGENTS.md dependency rule.

🟡 Should Fix (medium)

#	File	Line	Category	Reviewer(s)	Summary
3	`commands/benchmark/execute.py`	995	error-handling	Claude	`run_metadata.json` write is unguarded (unlike `results.json`); a write failure aborts an otherwise-complete benchmark.
4	`config/schema.py`	707	api-contract	Codex	`serving_node` skips the `1..65535` port check that `ssh_ids` enforces → bad port deferred to runtime.
5	`config/schema.py`	772	api-contract	Codex	`SysInfoFileConfig` doesn't propagate `endpoint_url`, so shared-config `sysinfo from-config` silently skips the serving-framework probe.
6	`tests/unit/commands/test_benchmark_finalization.py`	87	testing	Claude	Tests stub `_build_run_metadata` → the real builder (and bug #1) is never exercised.

🔵 Consider (low)

#	File	Line	Category	Reviewer(s)	Summary
7	`commands/benchmark/execute.py`	1086	data-integrity	Codex	`run_date` is naive local time — ambiguous for cross-node submission metadata; use UTC.
8	`commands/benchmark/execute.py`	903	bug	Claude	`_build_run_metadata` runs before `results.json` is written; a raise aborts before artifacts are saved.
9	`commands/benchmark/execute.py`	1080	bug	Codex	`system_tps / concurrency` has no zero-guard (safe today via validation; defensive only).
10	`config/schema.py`	976	api-contract	Codex	`model_copy(update=...)` bypasses the `endpoint_url` scheme validator.

Change summary & impact

What the PR does. Adds auto-population of MLPerf system description + run metadata. New pieces: a standalone sysinfo from-config command (commands/sysinfo/), a sys_info/capture.py wrapper around the external get-mlperf-multi-node-system-info mlcflow script, a system_info config block (SysInfoCaptureConfig / SysInfoFileConfig in schema.py), and a run_metadata.json emitter in benchmark finalization recording latency/throughput stats. Also includes a genuine bug fix in dataset_manager/transforms.py (columns_to_keep = list(self.required_columns) — prevents the later += from mutating the shared required_columns list across calls).

Design. Additive and largely clean. system_info defaults to None; the discriminated union and existing validators are unchanged; capture is fully gated behind system_info is not None, and a missing mlcflow raises SetupError from the lazily-imported mlc rather than crashing. The main design debts are the percentile-key contract mismatch (#1) and the duplicated-but-divergent validation between serving_node and ssh_ids (#4).

Performance. No hot-path impact. All new code runs in finalization or in the standalone sysinfo command — nothing touches load_generator/, endpoint_client/, or the ZMQ transport. No new hot-path serialization or async suspends.

Deployment / backward-compatibility (the explicit focus). Existing benchmark (offline/online/from-config) and probe workflows are preserved:

system_info is optional and defaults to None; regenerated *_template_full.yaml only append system_info: null.
No new required CLI args, no schema defaults changed, no discriminated-union change → existing YAMLs and CLI invocations keep working.
sysinfo capture is gated and failure-tolerant: ExecutionError and generic exceptions are caught and logged, and results.json/run_metadata.json are written before capture, so a missing/failing mlcflow never aborts a benchmark.

The one real deployment regression is #2: mlc-scripts lands in the core dependencies list unpinned, so every install now pulls the full mlcflow stack (even users who never run sysinfo) at an unpinned version. That widens the install surface and breaks reproducibility for existing consumers — worth resolving before merge. Functionally, #1 means the headline run_metadata.json latency percentiles ship as null today, and #6 explains why CI doesn't catch it.

- Fix percentile key mismatch in _build_run_metadata: registry stores str(float) keys ("99.0", "50.0") but lookups used integer strings ("99", "50"), causing all p50/p90/p95/p99 fields to be null - Guard run_metadata.json write in try/except and move _build_run_metadata inside the guard so results.json is always written first - Use open() instead of Path.open() for run_metadata.json write to make builtins.open patching work correctly in tests - Move _build_run_metadata call to just before run_metadata.json write so any future raise does not abort finalization before results.json - Add zero-guard to tps_per_user division (concurrency > 0) - Use datetime.now(UTC).isoformat() instead of datetime.now().isoformat() for unambiguous UTC timestamps in run_metadata.json - Remove _propagate_endpoint_url_to_sysinfo from BenchmarkConfig: endpoints[0] is the load target not necessarily the serving node - Add port range validation (1-65535) to serving_node field validator, matching the check already present in _validate_ssh_ids - Pin mlc-scripts==1.1.0 in pyproject.toml - Add tests covering all the above fixes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

anandhu-eng · 2026-06-08T14:44:27Z

Hi @arekay-nv @nvzhihanj, would you be able to review and merge this if everything looks good?

arekay-nv

Review Council — Multi-AI Code Review

Reviewed by: Claude (Codex unavailable — enterprise policy block) | Depth: thorough

Found 9 issues across multiple files. See inline comments for details.

arekay-nv · 2026-06-08T16:00:36Z

+    if ctx.config.system_info is not None:
+        try:
+            # Local import: mlcflow is optional and only needed when system_info is configured.
+            from inference_endpoint.sys_info.capture import (


[Claude] medium · design — Lazy import violates project rules and the SetupError is dead code.

# Local import: mlcflow is optional and only needed when system_info is configured. from inference_endpoint.sys_info.capture import ( capture_system_info, )

AGENTS.md only permits lazy imports for three reasons: circular-import avoidance, optional deps via a top-level try/except, or security sandboxing. None apply here: mlc-scripts==1.1.0 is in the core dependencies array in pyproject.toml, so the ImportError guard in capture.py is unreachable. Move the import to the module top level.

arekay-nv · 2026-06-08T16:00:36Z

+            output_path = capture_system_info(
+                ctx.config.system_info,
+                output_dir=ctx.report_dir,
+                run_metadata_path=ctx.report_dir / "run_metadata.json",


[Claude] high · data-integrity — run_metadata.json path is passed to mlcflow unconditionally even if the file was never written.

run_metadata_path=ctx.report_dir / "run_metadata.json",

If the open() write earlier in the try block raises OSError (disk full, permissions), the except Exception block logs the error and execution continues. The path handed to capture_system_info then points to a non-existent file. mlcflow will either error or produce silently incomplete output.

The standalone sysinfo CLI correctly guards with if candidate.exists(). Apply the same guard here:

metadata_path = ctx.report_dir / "run_metadata.json" run_metadata_path=metadata_path if metadata_path.exists() else None,

arekay-nv · 2026-06-08T16:00:36Z

+          ssh_ids:
+            - user@host
+          accelerator_backend: cuda
+          output_path: /tmp/sys_info


[Claude] medium · api-contract — Docstring example uses a non-existent field output_path.

system_info: output_path: /tmp/sys_info # this field does not exist

SysInfoCaptureConfig has no output_path field — output location is controlled by the top-level report_dir field of SysInfoFileConfig. Because SysInfoCaptureConfig uses extra="forbid", a user following this example gets a ValidationError at runtime with no obvious hint about the correct field name. Remove output_path from the example and add a top-level report_dir key instead.

arekay-nv · 2026-06-08T16:00:36Z

+    """Mirror the tag-building logic from capture.py."""
+    tags: list[str] = [
+        "get-mlperf-multi-node-system-info",
+        f"_{cfg.accelerator_backend}",


[Claude] medium · testing — _build_tags test helper diverges from the real implementation for accelerator_backend="none".

Helper (here):

f"_{cfg.accelerator_backend}", # always appended

Real code (capture.py):

if config.accelerator_backend \!= "none": tags.append(f"_{config.accelerator_backend}")

For accelerator_backend="none", the helper produces get-mlperf-multi-node-system-info,_none while the real function produces get-mlperf-multi-node-system-info. Since no existing test case uses "none", this divergence is invisible. Add a test case for accelerator_backend="none" and fix the helper to match the implementation.

arekay-nv · 2026-06-08T16:00:36Z

    "pytz==2026.1.post1",
    "urllib3==2.7.0",
+    # mlc-scripts for system info
+    "mlc-scripts==1.1.0",


[Claude] medium · design — mlc-scripts in core deps inflates every install and transitively pulls in requests.

"mlc-scripts==1.1.0",

This adds requests and giturlparse as transitive dependencies for every user of the benchmark tool, even those who never use system_info. AGENTS.md explicitly prohibits requests from production code. Consider moving mlc-scripts to an optional group (e.g. [project.optional-dependencies] sysinfo = ["mlc-scripts==1.1.0"]) so users opt in, which would also make the ImportError guard in capture.py live rather than dead code.

arekay-nv · 2026-06-08T16:00:36Z

+            }
+        }
+    }
+    fd, path = tempfile.mkstemp(suffix=".yaml", prefix="mlperf_node_cfg_")


[Claude] low · bug — Temp file leak if yaml.dump raises after mkstemp.

fd, path = tempfile.mkstemp(suffix=".yaml", prefix="mlperf_node_cfg_") with os.fdopen(fd, "w") as fh: yaml.dump(data, fh, default_flow_style=False) return path

If yaml.dump raises, the with block closes the fd but path is never returned, so node_config_tmp stays None in the caller and the finally block's os.unlink guard won't fire. Wrap with try/except to unlink on error:

fd, path = tempfile.mkstemp(suffix=".yaml", prefix="mlperf_node_cfg_") try: with os.fdopen(fd, "w") as fh: yaml.dump(data, fh, default_flow_style=False) except Exception: os.unlink(path) raise return path

arekay-nv · 2026-06-08T16:00:36Z

+            "log-based framework detection when endpoint_url is unavailable."
+        ),
+    )
+    log_path: str | None = Field(


[Claude] low · api-contract — No cross-field validation between serving_node and log_path.

log_path: str | None = Field(default=None, ...)

A user who sets log_path but omits serving_node silently passes the path to mlcflow without a target host, likely producing confusing or incomplete output. A model_validator checking log_path is None or serving_node is not None would catch this at config-load time with a clear error message.

arekay-nv · 2026-06-08T16:00:36Z

+    try:
+        import mlc  # noqa: PLC0415
+    except ImportError as exc:
+        raise SetupError(


[Claude] low · error-handling — ImportError handler is unreachable dead code.

try: import mlc # noqa: PLC0415 except ImportError as exc: raise SetupError("mlc-scripts is required...") from exc # never reached

mlc-scripts==1.1.0 is in core dependencies, so ImportError can never be raised in a correctly installed environment. This creates a misleading signal that the feature is optional-install-gated. See also: related comment on pyproject.toml:77 about moving mlc-scripts to an optional group, which would make this guard meaningful.

arekay-nv · 2026-06-08T16:00:36Z

+                "inference_endpoint.commands.benchmark.execute._build_run_metadata",
+                return_value={},
+            ),
+            patch("builtins.open", side_effect=OSError("disk full")),


[Claude] low · testing — Global builtins.open patch blocks all writes, including results.json.

patch("builtins.open", side_effect=OSError("disk full")),

The test only asserts no exception propagates, but because results.json is also blocked, it doesn't verify the write-ordering invariant (results.json written before run_metadata.json). The sibling test test_results_json_written_even_if_metadata_write_fails does this correctly with selective patching. Consider renaming this test to test_disk_full_does_not_abort_finalize to clarify its narrower scope.

arekay-nv · 2026-06-08T16:00:57Z

Review Council — Multi-AI Code Review

Reviewed by: Claude (Codex unavailable — enterprise policy block) | Depth: thorough

Found 9 issues across 7 files.

🔴 Must Fix (high)

#	File	Line	Category	Summary
1	`src/inference_endpoint/commands/benchmark/execute.py`	1020	data-integrity	`run_metadata.json` path passed to mlcflow unconditionally even if the file write failed

🟡 Should Fix (medium)

#	File	Line	Category	Summary
2	`src/inference_endpoint/commands/benchmark/execute.py`	1013	design	Lazy import violates AGENTS.md rules; `ImportError` guard is dead code since `mlc-scripts` is in core deps
3	`src/inference_endpoint/commands/sysinfo/cli.py`	50	api-contract	Docstring example uses `output_path` which doesn't exist on `SysInfoCaptureConfig` — causes `ValidationError` at runtime
4	`tests/unit/sys_info/test_capture.py`	148	testing	`_build_tags` helper always appends `accelerator_backend` tag; real code skips it for `"none"`, gap is untested
5	`pyproject.toml`	77	design	`mlc-scripts` in core deps pulls in transitive `requests` (violates AGENTS.md) and inflates every install

🔵 Consider (low)

#	File	Line	Category	Summary
6	`src/inference_endpoint/sys_info/capture.py`	56	bug	Temp file leak if `yaml.dump` raises after `mkstemp` (fd closed but path never unlinked)
7	`src/inference_endpoint/config/schema.py`	717	api-contract	No cross-field validation: `log_path` without `serving_node` silently passes an incomplete config to mlcflow
8	`src/inference_endpoint/sys_info/capture.py`	78	error-handling	`ImportError` handler is unreachable dead code (related to #2 and #5)
9	`tests/unit/commands/test_benchmark_finalization.py`	90	testing	Global `builtins.open` patch also blocks `results.json`; test scope is narrower than its name implies

anandhu-eng and others added 6 commits May 14, 2026 16:54

Initial modifications for sys info

0d89afc

Add A40 (Prefill) + H100 (Decode) sysinfo example config

1769b6f

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

further modification

8bb0649

update doc, add per run datadictionary capture

bb9b6cc

fixes for run_metadata

5d5ca3c

Merge branch 'main' into sysinfochanges

26138e3

anandhu-eng requested a review from a team May 28, 2026 21:40

anandhu-eng marked this pull request as draft May 28, 2026 21:40

github-actions Bot requested review from arekay-nv and nvzhihanj May 28, 2026 21:40

github-code-quality Bot found potential problems May 28, 2026

View reviewed changes

Comment thread src/inference_endpoint/commands/benchmark/execute.py Fixed

gemini-code-assist Bot reviewed May 28, 2026

View reviewed changes

anandhu-eng and others added 4 commits May 29, 2026 16:06

fix: sort imports in test_sysinfo_command.py (ruff)

9b88a81

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

anandhu-eng marked this pull request as ready for review June 2, 2026 16:13

anandhu-eng and others added 9 commits June 2, 2026 21:43

Merge branch 'main' into sysinfochanges

e134a90

Merge branch 'main' into sysinfochanges

7b8e276

Merge branch 'main' into sysinfochanges

1db9d01

fix: regenerate config templates after merge from main

db721b0

Re-run scripts/regenerate_templates.py after merging main (fbe543f drain timeout changes) into sysinfochanges. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

regenerated templates

435e0ff

Delete examples/sysinfo_a40_prefill_h100_decode.yaml

f6777b3

Address cve-2026-34993 (#333)

19b20bb

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

Merge branch 'main' into sysinfochanges

8d99b66

arekay-nv reviewed Jun 4, 2026

View reviewed changes

viraatc force-pushed the main branch from 19b20bb to 28f5ac1 Compare June 4, 2026 23:25

attafosu reviewed Jun 5, 2026

View reviewed changes

Comment thread docs/commands/DESIGN.md Outdated

anandhu-eng and others added 4 commits June 5, 2026 19:15

improve readme + force use report dir for sysinfo output

979d984

pre commit

4e1121b

Merge branch 'main' into sysinfochanges

d1539be

Merge branch 'main' into sysinfochanges

60989d6

arekay-nv reviewed Jun 8, 2026

View reviewed changes

Conversation

anandhu-eng commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Type of change

Related issues

Testing

Checklist

Uh oh!

github-actions Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arekay-nv left a comment

Choose a reason for hiding this comment

Review Council — Multi-AI Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

arekay-nv commented Jun 4, 2026

Review Council — Multi-AI Code Review

🔴 Must Fix (high)

🟡 Should Fix (medium)

🔵 Consider (low)

Change summary & impact

Uh oh!

Uh oh!

anandhu-eng commented Jun 8, 2026

Uh oh!

arekay-nv left a comment

Choose a reason for hiding this comment

Review Council — Multi-AI Code Review

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arekay-nv commented Jun 8, 2026

Review Council — Multi-AI Code Review

🔴 Must Fix (high)

🟡 Should Fix (medium)

🔵 Consider (low)

Uh oh!

Reviewers

Assignees

Labels

anandhu-eng commented May 28, 2026 •

edited

Loading

github-actions Bot commented May 28, 2026 •

edited

Loading