Skip to content

Add support for auto populating system description and run metadata #327

Open
anandhu-eng wants to merge 24 commits into
mainfrom
sysinfochanges
Open

Add support for auto populating system description and run metadata #327
anandhu-eng wants to merge 24 commits into
mainfrom
sysinfochanges

Conversation

@anandhu-eng

@anandhu-eng anandhu-eng commented May 28, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

This PR adds automated system info capture for MLPerf inference submissions. There are two ways to trigger it:

  1. Standalone (inference-endpoint sysinfo from-config --config benchmark_config.yaml): collects hardware, software info and serving config from one or more remote nodes via SSH and writes a system_desc.json suitable for endpoints, independent of any benchmark run.

  2. Integrated(if system_info block is included in the benchmark_config.yml when running the benchmark): runs automatically after a benchmark completes. In addition to collecting the same hardware info, it also extracts serving configuration (tensor/pipeline/expert parallelism, batch size, framework version) from the inference server's startup log or via HTTP probe, and patches those values into run_metadata.json.

Both paths SSH into each target node to run the get-mlperf-multi-node-system-info mlcflow script, merge the per-node results into a single JSON, and optionally validate the topology against a declared node_config (Eg; Prefill/Decode node groupings). Failures in the integrated path are non-blocking, results.json and run_metadata.json are always written first.

Type of change

  • Bug fix
  • New feature
  • Documentation update
  • Refactor/cleanup

Related issues

Testing

  • Tests added/updated
  • All tests pass locally
  • Manual testing completed

Checklist

  • Code follows project style
  • Pre-commit hooks pass
  • Documentation updated (if needed)

@anandhu-eng anandhu-eng requested a review from a team May 28, 2026 21:40
@anandhu-eng anandhu-eng marked this pull request as draft May 28, 2026 21:40
@github-actions

github-actions Bot commented May 28, 2026

Copy link
Copy Markdown

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@github-actions github-actions Bot requested review from arekay-nv and nvzhihanj May 28, 2026 21:40
Comment thread src/inference_endpoint/commands/benchmark/execute.py Fixed

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a system information capture feature (sysinfo) using mlcflow to collect hardware, software, and serving configurations from multi-node environments, both as a standalone CLI command and integrated into benchmark finalization. It also fixes a bug in ColumnFilter where repeated calls mutated required_columns. The review feedback highlights several key issues: the default output filename in capture.py should be updated to match the design documentation and tests; capture_system_info needs to check new_env for the output path; there is a format discrepancy between the written run_metadata.json and the documented run_metadata.yml; and mlcflow should be moved to optional dependencies in pyproject.toml to align with its optional usage in the code.

Comment thread src/inference_endpoint/sys_info/capture.py
Comment thread src/inference_endpoint/sys_info/capture.py
Comment thread src/inference_endpoint/commands/benchmark/execute.py Outdated
Comment thread pyproject.toml Outdated
"pytz==2026.1.post1",
"urllib3==2.7.0",
# MLCFlow for system info
"mlcflow",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

mlcflow has been added to the core dependencies list. However, comments in capture.py and execute.py state that mlcflow is an optional dependency only needed when system_info is configured. If it is indeed intended to be optional, it should be moved to [project.optional-dependencies] (e.g., under a sys-info extra) to avoid forcing all users to install it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to have the mlc-script dependency in the core list, as it is one of the main tasks, with the flexibility to opt out. That said, I am happy to move it to an optional dependency as well.

anandhu-eng and others added 4 commits May 29, 2026 16:06
- Remove unused `import yaml` from benchmark/execute.py
- Rename mlcflow → mlc-scripts in pyproject.toml and uv.lock; update
  capture.py comment and SetupError install hint accordingly
- Update DESIGN.md: run_metadata.yml → run_metadata.json,
  mlperf-multi-node-system-info.json → system_desc.json
- Fix test_capture.py assertions to match actual out_file_name
  (system_desc.json) and updated SetupError message (mlc-scripts)
- Regenerate _full config templates (system_info field now visible)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- BenchmarkConfig field was named system_info but tests and the
  intended YAML API used sys_info_capture; rename throughout schema.py,
  execute.py, and the endpoint-url propagation validator
- Fix capture_system_info return path to use MLC_MULTI_NODE_SYSTEM_INFO_FILE_PATH
  from new_env when present, falling back to output_path/system_desc.json
- Update fake_capture stubs in test_sysinfo_command.py to accept
  run_metadata_path kwarg passed by the sysinfo CLI
- Regenerate _full config templates after schema field rename

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Fail with ExecutionError when MLC_MULTI_NODE_SYSTEM_INFO_FILE_PATH is
  not returned by mlcflow (replaces silent fallback to default path)
- Test that missing ssh_ids in YAML raises ValidationError regardless of
  exclude_current_system value
- Test that sys info failure (ExecutionError or unexpected exception)
  does not block results.json from being written in finalize_benchmark
- Test unreachable node and node_config count mismatch scenarios
- Replace internal hostname in test fixtures with generic user@10.0.0.1

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@anandhu-eng anandhu-eng marked this pull request as ready for review June 2, 2026 16:13
anandhu-eng and others added 9 commits June 2, 2026 21:43
…nfig

Renames the `sys_info_capture` YAML key to `system_info` in BenchmarkConfig
for consistency with SysInfoFileConfig which already uses `system_info`.
Updates all Python references, config templates, tests, and docs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Re-run scripts/regenerate_templates.py after merging main
(fbe543f drain timeout changes) into sysinfochanges.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

@arekay-nv arekay-nv left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Council — Multi-AI Code Review

Reviewed by: Codex (gpt-5.4) + Claude | Depth: thorough

10 inline findings posted across the sysinfo / run-metadata changes. See the summary comment for the tiered breakdown and the design/performance/deployment impact analysis.

Comment thread src/inference_endpoint/commands/benchmark/execute.py Outdated
Comment thread pyproject.toml Outdated
Comment thread src/inference_endpoint/commands/benchmark/execute.py Outdated
Comment thread src/inference_endpoint/config/schema.py
Comment thread src/inference_endpoint/config/schema.py
Comment thread tests/unit/commands/test_benchmark_finalization.py
Comment thread src/inference_endpoint/commands/benchmark/execute.py Outdated
Comment thread src/inference_endpoint/commands/benchmark/execute.py Outdated
Comment thread src/inference_endpoint/commands/benchmark/execute.py
Comment thread src/inference_endpoint/config/schema.py Outdated
@arekay-nv

Copy link
Copy Markdown
Collaborator

Review Council — Multi-AI Code Review

Reviewed by: Codex (gpt-5.4) + Claude | Depth: thorough

Found 10 issues across 5 files. All line numbers verified against source at HEAD (8d99b66); two topics already raised by existing bots (mlcflow core-vs-extra, run_metadata JSON/YAML doc mismatch) were de-duplicated.

🔴 Must Fix (high)

# File Line Category Reviewer(s) Summary
1 commands/benchmark/execute.py 1100 data-integrity Both _pct looks up "99"/"95"/"90"/"50" but registry keys are float-strings ("99.0"…); ttft + all p50/p90/p95/p99 fields write null. Only *_p999 works.
2 pyproject.toml 77 design Claude mlc-scripts added with no == pin — violates AGENTS.md dependency rule.

🟡 Should Fix (medium)

# File Line Category Reviewer(s) Summary
3 commands/benchmark/execute.py 995 error-handling Claude run_metadata.json write is unguarded (unlike results.json); a write failure aborts an otherwise-complete benchmark.
4 config/schema.py 707 api-contract Codex serving_node skips the 1..65535 port check that ssh_ids enforces → bad port deferred to runtime.
5 config/schema.py 772 api-contract Codex SysInfoFileConfig doesn't propagate endpoint_url, so shared-config sysinfo from-config silently skips the serving-framework probe.
6 tests/unit/commands/test_benchmark_finalization.py 87 testing Claude Tests stub _build_run_metadata → the real builder (and bug #1) is never exercised.

🔵 Consider (low)

# File Line Category Reviewer(s) Summary
7 commands/benchmark/execute.py 1086 data-integrity Codex run_date is naive local time — ambiguous for cross-node submission metadata; use UTC.
8 commands/benchmark/execute.py 903 bug Claude _build_run_metadata runs before results.json is written; a raise aborts before artifacts are saved.
9 commands/benchmark/execute.py 1080 bug Codex system_tps / concurrency has no zero-guard (safe today via validation; defensive only).
10 config/schema.py 976 api-contract Codex model_copy(update=...) bypasses the endpoint_url scheme validator.

Change summary & impact

What the PR does. Adds auto-population of MLPerf system description + run metadata. New pieces: a standalone sysinfo from-config command (commands/sysinfo/), a sys_info/capture.py wrapper around the external get-mlperf-multi-node-system-info mlcflow script, a system_info config block (SysInfoCaptureConfig / SysInfoFileConfig in schema.py), and a run_metadata.json emitter in benchmark finalization recording latency/throughput stats. Also includes a genuine bug fix in dataset_manager/transforms.py (columns_to_keep = list(self.required_columns) — prevents the later += from mutating the shared required_columns list across calls).

Design. Additive and largely clean. system_info defaults to None; the discriminated union and existing validators are unchanged; capture is fully gated behind system_info is not None, and a missing mlcflow raises SetupError from the lazily-imported mlc rather than crashing. The main design debts are the percentile-key contract mismatch (#1) and the duplicated-but-divergent validation between serving_node and ssh_ids (#4).

Performance. No hot-path impact. All new code runs in finalization or in the standalone sysinfo command — nothing touches load_generator/, endpoint_client/, or the ZMQ transport. No new hot-path serialization or async suspends.

Deployment / backward-compatibility (the explicit focus). Existing benchmark (offline/online/from-config) and probe workflows are preserved:

  • system_info is optional and defaults to None; regenerated *_template_full.yaml only append system_info: null.
  • No new required CLI args, no schema defaults changed, no discriminated-union change → existing YAMLs and CLI invocations keep working.
  • sysinfo capture is gated and failure-tolerant: ExecutionError and generic exceptions are caught and logged, and results.json/run_metadata.json are written before capture, so a missing/failing mlcflow never aborts a benchmark.

The one real deployment regression is #2: mlc-scripts lands in the core dependencies list unpinned, so every install now pulls the full mlcflow stack (even users who never run sysinfo) at an unpinned version. That widens the install surface and breaks reproducibility for existing consumers — worth resolving before merge. Functionally, #1 means the headline run_metadata.json latency percentiles ship as null today, and #6 explains why CI doesn't catch it.

- Fix percentile key mismatch in _build_run_metadata: registry stores
  str(float) keys ("99.0", "50.0") but lookups used integer strings
  ("99", "50"), causing all p50/p90/p95/p99 fields to be null

- Guard run_metadata.json write in try/except and move _build_run_metadata
  inside the guard so results.json is always written first

- Use open() instead of Path.open() for run_metadata.json write to make
  builtins.open patching work correctly in tests

- Move _build_run_metadata call to just before run_metadata.json write
  so any future raise does not abort finalization before results.json

- Add zero-guard to tps_per_user division (concurrency > 0)

- Use datetime.now(UTC).isoformat() instead of datetime.now().isoformat()
  for unambiguous UTC timestamps in run_metadata.json

- Remove _propagate_endpoint_url_to_sysinfo from BenchmarkConfig:
  endpoints[0] is the load target not necessarily the serving node

- Add port range validation (1-65535) to serving_node field validator,
  matching the check already present in _validate_ssh_ids

- Pin mlc-scripts==1.1.0 in pyproject.toml

- Add tests covering all the above fixes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread docs/commands/DESIGN.md Outdated
@anandhu-eng

Copy link
Copy Markdown
Contributor Author

Hi @arekay-nv @nvzhihanj, would you be able to review and merge this if everything looks good?

@arekay-nv arekay-nv left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Council — Multi-AI Code Review

Reviewed by: Claude (Codex unavailable — enterprise policy block) | Depth: thorough

Found 9 issues across multiple files. See inline comments for details.

if ctx.config.system_info is not None:
try:
# Local import: mlcflow is optional and only needed when system_info is configured.
from inference_endpoint.sys_info.capture import (

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] medium · design — Lazy import violates project rules and the SetupError is dead code.

# Local import: mlcflow is optional and only needed when system_info is configured.
from inference_endpoint.sys_info.capture import (
    capture_system_info,
)

AGENTS.md only permits lazy imports for three reasons: circular-import avoidance, optional deps via a top-level try/except, or security sandboxing. None apply here: mlc-scripts==1.1.0 is in the core dependencies array in pyproject.toml, so the ImportError guard in capture.py is unreachable. Move the import to the module top level.

output_path = capture_system_info(
ctx.config.system_info,
output_dir=ctx.report_dir,
run_metadata_path=ctx.report_dir / "run_metadata.json",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] high · data-integrityrun_metadata.json path is passed to mlcflow unconditionally even if the file was never written.

run_metadata_path=ctx.report_dir / "run_metadata.json",

If the open() write earlier in the try block raises OSError (disk full, permissions), the except Exception block logs the error and execution continues. The path handed to capture_system_info then points to a non-existent file. mlcflow will either error or produce silently incomplete output.

The standalone sysinfo CLI correctly guards with if candidate.exists(). Apply the same guard here:

metadata_path = ctx.report_dir / "run_metadata.json"
run_metadata_path=metadata_path if metadata_path.exists() else None,

ssh_ids:
- user@host
accelerator_backend: cuda
output_path: /tmp/sys_info

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] medium · api-contract — Docstring example uses a non-existent field output_path.

system_info:
  output_path: /tmp/sys_info   # this field does not exist

SysInfoCaptureConfig has no output_path field — output location is controlled by the top-level report_dir field of SysInfoFileConfig. Because SysInfoCaptureConfig uses extra="forbid", a user following this example gets a ValidationError at runtime with no obvious hint about the correct field name. Remove output_path from the example and add a top-level report_dir key instead.

"""Mirror the tag-building logic from capture.py."""
tags: list[str] = [
"get-mlperf-multi-node-system-info",
f"_{cfg.accelerator_backend}",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] medium · testing_build_tags test helper diverges from the real implementation for accelerator_backend="none".

Helper (here):

f"_{cfg.accelerator_backend}",   # always appended

Real code (capture.py):

if config.accelerator_backend \!= "none":
    tags.append(f"_{config.accelerator_backend}")

For accelerator_backend="none", the helper produces get-mlperf-multi-node-system-info,_none while the real function produces get-mlperf-multi-node-system-info. Since no existing test case uses "none", this divergence is invisible. Add a test case for accelerator_backend="none" and fix the helper to match the implementation.

Comment thread pyproject.toml
"pytz==2026.1.post1",
"urllib3==2.7.0",
# mlc-scripts for system info
"mlc-scripts==1.1.0",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] medium · designmlc-scripts in core deps inflates every install and transitively pulls in requests.

"mlc-scripts==1.1.0",

This adds requests and giturlparse as transitive dependencies for every user of the benchmark tool, even those who never use system_info. AGENTS.md explicitly prohibits requests from production code. Consider moving mlc-scripts to an optional group (e.g. [project.optional-dependencies] sysinfo = ["mlc-scripts==1.1.0"]) so users opt in, which would also make the ImportError guard in capture.py live rather than dead code.

}
}
}
fd, path = tempfile.mkstemp(suffix=".yaml", prefix="mlperf_node_cfg_")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] low · bug — Temp file leak if yaml.dump raises after mkstemp.

fd, path = tempfile.mkstemp(suffix=".yaml", prefix="mlperf_node_cfg_")
with os.fdopen(fd, "w") as fh:
    yaml.dump(data, fh, default_flow_style=False)
return path

If yaml.dump raises, the with block closes the fd but path is never returned, so node_config_tmp stays None in the caller and the finally block's os.unlink guard won't fire. Wrap with try/except to unlink on error:

fd, path = tempfile.mkstemp(suffix=".yaml", prefix="mlperf_node_cfg_")
try:
    with os.fdopen(fd, "w") as fh:
        yaml.dump(data, fh, default_flow_style=False)
except Exception:
    os.unlink(path)
    raise
return path

"log-based framework detection when endpoint_url is unavailable."
),
)
log_path: str | None = Field(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] low · api-contract — No cross-field validation between serving_node and log_path.

log_path: str | None = Field(default=None, ...)

A user who sets log_path but omits serving_node silently passes the path to mlcflow without a target host, likely producing confusing or incomplete output. A model_validator checking log_path is None or serving_node is not None would catch this at config-load time with a clear error message.

try:
import mlc # noqa: PLC0415
except ImportError as exc:
raise SetupError(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] low · error-handlingImportError handler is unreachable dead code.

try:
    import mlc  # noqa: PLC0415
except ImportError as exc:
    raise SetupError("mlc-scripts is required...") from exc  # never reached

mlc-scripts==1.1.0 is in core dependencies, so ImportError can never be raised in a correctly installed environment. This creates a misleading signal that the feature is optional-install-gated. See also: related comment on pyproject.toml:77 about moving mlc-scripts to an optional group, which would make this guard meaningful.

"inference_endpoint.commands.benchmark.execute._build_run_metadata",
return_value={},
),
patch("builtins.open", side_effect=OSError("disk full")),

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] low · testing — Global builtins.open patch blocks all writes, including results.json.

patch("builtins.open", side_effect=OSError("disk full")),

The test only asserts no exception propagates, but because results.json is also blocked, it doesn't verify the write-ordering invariant (results.json written before run_metadata.json). The sibling test test_results_json_written_even_if_metadata_write_fails does this correctly with selective patching. Consider renaming this test to test_disk_full_does_not_abort_finalize to clarify its narrower scope.

@arekay-nv

Copy link
Copy Markdown
Collaborator

Review Council — Multi-AI Code Review

Reviewed by: Claude (Codex unavailable — enterprise policy block) | Depth: thorough

Found 9 issues across 7 files.

🔴 Must Fix (high)

# File Line Category Summary
1 src/inference_endpoint/commands/benchmark/execute.py 1020 data-integrity run_metadata.json path passed to mlcflow unconditionally even if the file write failed

🟡 Should Fix (medium)

# File Line Category Summary
2 src/inference_endpoint/commands/benchmark/execute.py 1013 design Lazy import violates AGENTS.md rules; ImportError guard is dead code since mlc-scripts is in core deps
3 src/inference_endpoint/commands/sysinfo/cli.py 50 api-contract Docstring example uses output_path which doesn't exist on SysInfoCaptureConfig — causes ValidationError at runtime
4 tests/unit/sys_info/test_capture.py 148 testing _build_tags helper always appends accelerator_backend tag; real code skips it for "none", gap is untested
5 pyproject.toml 77 design mlc-scripts in core deps pulls in transitive requests (violates AGENTS.md) and inflates every install

🔵 Consider (low)

# File Line Category Summary
6 src/inference_endpoint/sys_info/capture.py 56 bug Temp file leak if yaml.dump raises after mkstemp (fd closed but path never unlinked)
7 src/inference_endpoint/config/schema.py 717 api-contract No cross-field validation: log_path without serving_node silently passes an incomplete config to mlcflow
8 src/inference_endpoint/sys_info/capture.py 78 error-handling ImportError handler is unreachable dead code (related to #2 and #5)
9 tests/unit/commands/test_benchmark_finalization.py 90 testing Global builtins.open patch also blocks results.json; test scope is narrower than its name implies

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants