Skip to content

docs: compliance audit module redesign plan#332

Open
wu6u3tw wants to merge 8 commits into
mlcommons:mainfrom
wu6u3tw:feat/test04-compliance
Open

docs: compliance audit module redesign plan#332
wu6u3tw wants to merge 8 commits into
mlcommons:mainfrom
wu6u3tw:feat/test04-compliance

Conversation

@wu6u3tw

@wu6u3tw wu6u3tw commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

Summary

This PR is a design proposal (docs/compliance_audit_plan.md), not an implementation. The prior TEST04 implementation was reset after the design review on this PR; this redesign folds in that feedback plus the maintainer's workflow requirements before any code is written.

The plan describes a modular, extensible compliance/audit framework with TEST04 as the first concrete test (targeting WAN2.2-T2V) and TEST01/06/07/09 as documented extensions.

What the design specifies

  • AuditTest protocol (plan_runs + verify), resolved from a registry. A generic orchestrator runs the planned phases back-to-back and never names a specific test. Adding a test = one file + one enum value + one import line.
  • One command, both phases, same query count. An audit: block runs the reference and audit phases back-to-back at a single shared sample count — addressing the fairness concern (no more ref=50 / audit=20).
  • No leakage into general-purpose code. No test04 boolean, no DatasetType.AUDIT; run modification is a generic typed SampleOrderSpec, and audit params live in a structured AuditConfig sub-model (mirroring AccuracyConfig).
  • Typed RunSpec replaces the stringly-typed override kwarg; all specs validated against the loaded dataset size, and paced loads rejected, before any run executes.
  • One verifier core + thin adapters; atomic verdict write; robust from_dir re-check path.
  • Unified submission. audit: is additive — a single type: submission YAML runs perf + accuracy + audit under one report_dir.
  • WAN2.2-T2V first target. Offline + SingleStream example configs, with the MLCommons audit.config knobs (performance_issue_same_index=3, tolerance, min_query_count) mapped to AuditConfig.

Diagrams & traceability

  • §4 includes an ASCII component map and a TEST04 two-phase program-flow diagram.
  • §8 maps every comment thread on this PR — both Review Council passes, the maintainer/encapsulation threads, and the Gemini robustness comments — to where the design resolves it.

Status

Seeking design buy-in before implementation. Once acknowledged, the implementation lands following §6 (file-by-file) and §9 (success criteria).

🤖 Generated with Claude Code

@wu6u3tw wu6u3tw requested a review from a team June 3, 2026 20:53
@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements the MLPerf TEST04 compliance audit to detect result caching by repeatedly issuing a single fixed sample and comparing the throughput against a reference run. It introduces configuration options, validation guards, a SingleSampleOrder generator, and a compliance verification module with a CLI tool and tests. The review feedback focuses on improving the robustness of the compliance verifier, specifically by handling potential OSError exceptions during file writes, catching AttributeError when parsing non-dictionary JSON configurations, and gracefully handling malformed snapshot files during parsing.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +55 to +60
try:
result = verify_test04_dirs(reference_dir, test04_dir, threshold=threshold)
except (FileNotFoundError, ValueError) as e:
print(f"TEST04 ERROR: {e}", file=sys.stderr)
sys.exit(2)
verdict_path = write_verdict(result, output_dir or test04_dir)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The test04 CLI command only catches FileNotFoundError and ValueError when calling verify_test04_dirs. However, file reading inside verify_test04_dirs can raise other OSError subclasses like PermissionError. Furthermore, write_verdict is called outside the try-except block and can also raise OSError (e.g., if the directory is read-only or disk is full). Wrapping both calls in a try-except block that catches OSError ensures all setup/IO errors are handled gracefully and exit with code 2 as documented.

Suggested change
try:
result = verify_test04_dirs(reference_dir, test04_dir, threshold=threshold)
except (FileNotFoundError, ValueError) as e:
print(f"TEST04 ERROR: {e}", file=sys.stderr)
sys.exit(2)
verdict_path = write_verdict(result, output_dir or test04_dir)
try:
result = verify_test04_dirs(reference_dir, test04_dir, threshold=threshold)
verdict_path = write_verdict(result, output_dir or test04_dir)
except (OSError, ValueError) as e:
print(f"TEST04 ERROR: {e}", file=sys.stderr)
sys.exit(2)

Comment thread src/inference_endpoint/compliance/test04.py Outdated
Comment on lines +156 to +157
report = Report.from_snapshot(json.loads(snapshot_path.read_bytes()))
qps = report.qps()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In _run_stats_from_dir, Report.from_snapshot is called directly on the loaded JSON. If the snapshot JSON is malformed or has an unexpected structure, Report.from_snapshot can raise KeyError or TypeError. These exceptions are not caught by verify_test04_dirs or the CLI's test04 command, leading to an unhandled crash with a traceback instead of a clean exit with code 2. Wrapping the call in a try-except block and raising a ValueError ensures these errors are handled gracefully.

    try:
        report = Report.from_snapshot(json.loads(snapshot_path.read_bytes()))
    except (ValueError, KeyError, TypeError) as e:
        raise ValueError(f"Failed to parse snapshot in {run_dir}: {e}") from e
    qps = report.qps()

wu6u3tw added a commit to wu6u3tw/endpoints that referenced this pull request Jun 3, 2026
…review

Address gemini-code-assist review on PR mlcommons#332:
- CLI catches OSError (PermissionError etc.) and write_verdict failures,
  not just FileNotFoundError/ValueError — all map to exit 2.
- _audit_marker tolerates non-dict results.json (isinstance guards) instead
  of raising AttributeError.
- _run_stats_from_dir rejects a non-dict snapshot with a clear ValueError.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@wu6u3tw wu6u3tw requested review from arekay-nv and nv-alicheng June 3, 2026 22:19
Comment thread examples/09_Wan22_VideoGen_Example/wan22_audit_test04.yaml Outdated
Comment on lines +14 to +25
# # 1. Reference performance run
# inference-endpoint benchmark from-config \
# examples/09_Wan22_VideoGen_Example/offline_wan22.yaml
#
# # 2. This TEST04 run (audit: test04 issues one fixed prompt)
# inference-endpoint benchmark from-config \
# examples/09_Wan22_VideoGen_Example/offline_wan22_test04.yaml
#
# # 3. Compare throughput (videos/s) and write the verdict
# python -m inference_endpoint.compliance test04 \
# logs/wan22_video_generation_benchmark \
# logs/wan22_video_generation_test04

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem the best flow to run audit tests. In Test04, only partial of the original queries will be sent (25/50?) and same in repeated tests, and you should be running 1 command instead of 2 to avoid underlying endpoints to change (back to back). You would be forced to run the whole 248 queries if you design this way

My thinking is that you should define either the loadgen (with some audit pattern) or datasets (with type: audit) to run this behavior, and inference endpoints will internally run 2 stages and compare the perf. Right now the workflow is segregated and seems clunky to run.

@arekay-nv @nv-alicheng to share your opinions as well

Comment thread examples/09_Wan22_VideoGen_Example/offline_wan22_test04.yaml Outdated
@wu6u3tw

wu6u3tw commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator Author

Update summary

All review feedback has been addressed. Here is what changed since the original submission:

Architecture (main concern)

  • audit: test04 now runs both phases in a single command — reference run then audit run back-to-back against the same endpoint, with automatic comparison and verdict output. No more 3-step manual workflow.

Config shape

  • type: audit dataset replaces the old settings.runtime.test04_sample_index and audit_n_samples runtime variables. Reference and audit sample counts are now independent and co-located with the dataset config — consistent with how type: accuracy datasets carry their own accuracy_config.
audit: "test04"

datasets:
  - name: wan22_prompts
    path: wan22_prompts.jsonl
    type: "performance"
    samples: 50          # reference phase query count (50–144)

  - name: wan22_audit
    path: wan22_prompts.jsonl
    type: "audit"
    samples: 25          # audit phase query count (25–50)
    audit_sample_index: 0

Robustness

  • Warning logged when audit: test04 is set but no type: audit dataset is present (previously silent fallback to index 0).
  • Phase failures (SetupError/ExecutionError) are caught and logged cleanly — no unhandled traceback, verdict not lost.
  • Report.from_snapshot wrapped in try/except in _run_stats_from_dir — malformed snapshots exit with code 2 instead of crashing.
  • Pre-flight audit_sample_index bounds check before dataset load.

Testing

  • New e2e integration test (test_audit_test04_two_phase_flow) exercises the full run_benchmark → two-phase flow against the echo server and asserts both phase subdirs are created and the flow completes gracefully.

Example config

  • Renamed offline_wan22_test04.yamlwan22_audit_test04.yaml per review suggestion.

@wu6u3tw wu6u3tw force-pushed the feat/test04-compliance branch from 9057190 to b547f1d Compare June 4, 2026 23:14
@wu6u3tw wu6u3tw force-pushed the feat/test04-compliance branch 2 times, most recently from cdbae64 to eae1234 Compare June 4, 2026 23:40
- name: wan22_prompts
path: examples/09_Wan22_VideoGen_Example/wan22_prompts.jsonl
type: "audit"
ref_samples: 50 # reference phase query count

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we choose the sample ref_samples and audit_samples - is it intended to be 50 and 25?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified.

@nvzhihanj nvzhihanj Jun 6, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I am still a bit confused - we are comparing the QPS of 50 dataset samples with 20 repeated samples? That doesn't seem fair - can you remind me when the decision was made?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In wan22, ref samples is current set as 248 and audit samples is 20.

Comment thread examples/09_Wan22_VideoGen_Example/wan22_audit_test04.yaml Outdated
Comment thread examples/09_Wan22_VideoGen_Example/wan22_audit_test04.yaml Outdated
@@ -0,0 +1,329 @@
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an odd file name. test_audit_test04 probably

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed.

Comment thread README.md Outdated
@wu6u3tw wu6u3tw force-pushed the feat/test04-compliance branch 3 times, most recently from b190d21 to e0de06f Compare June 5, 2026 21:03

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change max_duration_ms to a reasonable number.

@wu6u3tw wu6u3tw force-pushed the feat/test04-compliance branch 2 times, most recently from 385630c to c1e48bf Compare June 5, 2026 21:22
@wu6u3tw

wu6u3tw commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator Author

All review feedback has been addressed. Here's a summary of what changed:

Architecture
audit: test04 now runs reference and audit phases in a single command back-to-back against the same endpoint — no more 3-step workflow, no endpoint-change risk. A single type: audit dataset entry drives both phases (carrying ref_samples, audit_samples, audit_sample_index).

Sample counts & index
ref_samples: 50, audit_samples: 25 — sized for WAN2.2 throughput. audit_sample_index: 3 — fixed per MLCommons audit.config (performance_issue_same_index=3 for WAN2.2).

SingleStream
Added wan22_single_stream_test04.yaml (concurrency=1, ref/audit samples=20 matching MLCommons min_query_count).

Durations
Perf configs: min=10min, max=4hr. Audit configs: min=10min, max=2hr. The 10-min minimum documents MLCommons compliance intent; counts take priority in the current session stop logic, with AND-semantics available as a future improvement.

Robustness fixes (Gemini)

  • write_verdict moved inside try-except in CLI
  • _audit_marker uses isinstance guards — no AttributeError possible
  • Report.from_snapshot wrapped in try/except (KeyError, TypeError) in _run_stats_from_dir

Cleanup

  • Test renamed to test_audit_test04.py
  • README.md removed from diff (rebased onto main)
  • Orphaned type: audit datasets in non-TEST04 configs now emit a warning; multiple audit datasets raise InputValidationError

@@ -0,0 +1,60 @@
# Offline TEST04 Compliance Run for WAN 2.2 (GB200/GB300)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming of the file is not consistent. some have offline/singlestream, some have audit_test04. Can you fix it?

@nvzhihanj nvzhihanj left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Council — first-principles design review

Reviewed by: Claude (Codex review timed out on this 2046-line diff at xhigh reasoning) · Depth: thorough

Focus: design issues warranting re-design for a modular, extensible audit-test framework (TEST04 is the first of several). 11 findings; see the tiered summary comment. The ref_samples dead-write (#1) was independently verified against the source.

other_datasets = [d for d in config.datasets if d.type != DatasetType.AUDIT]

ref_perf_ds = audit_ds.model_copy(
update={"type": DatasetType.PERFORMANCE, "samples": audit_ds.ref_samples}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] high · bug — ref_samples is a dead write.

ref_perf_ds = audit_ds.model_copy(update={..., "samples": audit_ds.ref_samples}) sets the Dataset.samples field (schema.py:308), but that field is consumed nowhere in src/: DataLoaderFactory.create_loader reads path/format/parser/name only, and there is no .samples read anywhere. Meanwhile ref_config (the lines just below) never sets n_samples_to_issue. So with the shipped example (min_duration_ms: 600000) the reference phase runs duration-driven and silently ignores ref_samples, while the audit phase does honor audit_samples (it sets n_samples_to_issue=audit_n). The two phases the verdict compares therefore run mismatched counts, and the PR's "ref_samples = reference phase query count" is false. Fix: set n_samples_to_issue=ref_samples on ref_config.settings.runtime exactly as the audit phase does (and drop the no-op samples= updates).


"""MLPerf compliance checks for endpoint benchmarking."""

from .test04 import (

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] high · design — No AuditTest abstraction; TEST04 is hardcoded end-to-end.

The compliance package exports only test04-named symbols, AuditMode has a single value, run_benchmark hardcodes if config.audit == AuditMode.TEST04: _run_benchmark_test04(...) (execute.py:1228), and the orchestrator is bespoke. Why re-design: the stated goal is a modular framework, but adding TEST01/TEST05 here means a cross-cutting edit of the enum + schema + load generator + verifier + orchestration for every new audit. Suggested shape: an AuditTest protocol in compliance/plan_runs(config) -> list[RunSpec] and verify(reports) -> Verdict — registered by AuditMode; run_benchmark dispatches audit.plan_runs(...) generically and never names a specific test.

Comment thread src/inference_endpoint/config/schema.py Outdated

PERFORMANCE = "performance"
ACCURACY = "accuracy"
AUDIT = "audit"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] high · design — DatasetType.AUDIT is an abstraction leak.

It is a dataset type the normal loader deliberately ignores-with-warning (_load_datasets, execute.py:256), exists only to carry TEST04 scalars (ref_samples/audit_samples/audit_sample_index, schema.py:321-335) on the shared Dataset model used by perf/accuracy datasets too, and is immediately model_copy(update={"type": PERFORMANCE})-converted in _run_benchmark_test04. Why re-design: a fake dataset type the loader can't load, smuggling test-specific params onto the shared schema, forces every future audit to either piggyback more Dataset fields or invent another fake type. Suggested shape: drop DatasetType.AUDIT; move the audit params into a structured audit: sub-config block, and let the audit's plan_runs derive each phase from a normal PERFORMANCE dataset + that block.


# kw_only so trailing defaults here don't force defaults on subclass fields
# (e.g. _RuntimeSettings.model in rulesets/mlcommons/rules.py).
test04: bool = field(default=False, kw_only=True)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] high · design — TEST04 leaks into the general-purpose load generator.

test04: bool and test04_sample_index: int are TEST04-specific fields on the frozen, general-purpose RuntimeSettings, and create_sample_order dispatches on settings.test04 (sample_order.py:128). Why re-design: each future audit would add its own boolean + factory branch into the hot-path config object and the ordering factory — TEST04 knowledge is now wired into core load-gen. Suggested shape: replace the per-test boolean with a generic ordering selector (e.g. a SampleOrderSpec strategy descriptor: WITHOUT_REPLACEMENT | SINGLE(index=...)); create_sample_order switches on that, and the audit populates it via its run plan. The load generator then knows nothing about "test04".

# Defined as a constant so the sender (_run_benchmark_test04) and the
# receiver (setup_benchmark) stay in sync without relying on matching
# string literals.
_OVERRIDE_TEST04_SAMPLE_INDEX = "test04_sample_index"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] medium · design — _OVERRIDE_TEST04_SAMPLE_INDEX is a stringly-typed magic kwarg.

It is threaded through **runtime_overrides from _run_benchmark_test04 (line ~1201) into setup_benchmark, which reaches into the opaque dict via runtime_overrides.get(_OVERRIDE_TEST04_SAMPLE_INDEX, 0) (~line 492) and forwards the rest into RuntimeSettings.from_config. Why re-design: hidden coupling — caller and callee must agree on a dict key, the value bypasses the typed config path, and the signature gives no hint an audit index can arrive. Suggested shape: pass a typed run_spec: AuditRunSpec | None so the sample index / ordering strategy are explicit, type-checked fields.

assert (tmp_path / "test04").is_dir()
assert "TEST04 audit: phase 1/2" in caplog.text
assert "TEST04 audit: phase 2/2" in caplog.text
verdict_written = (tmp_path / "test04_verdict.json").exists()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] medium · testing — The TEST04 orchestrator is effectively untested.

_run_benchmark_test04 (the model_copy surgery, ref_samples/audit_samples wiring, the pre-flight bounds loop, and the three phase-failure early-returns) has no unit test. The only coverage is this integration test, whose terminal assertion is lenient — assert verdict_written or comparison_errored — and it uses min_duration_ms=0, the one regime where the ref_samples dead-write doesn't bite. So it cannot catch the ignored-ref_samples bug, a swapped count, or a wrong verdict. Add unit tests asserting the reference phase issues ref_samples and the audit phase issues audit_samples, plus that each early-return path is hit.

Comment thread src/inference_endpoint/config/schema.py Outdated
Literal[TestType.OFFLINE, TestType.ONLINE] | None,
cyclopts.Parameter(show=False),
] = None
audit: Annotated[

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] medium · design — audit is a bare top-level enum with params scattered.

audit: AuditMode | None sits on BenchmarkConfig while its parameters live on the Dataset model (ref_samples/audit_samples/audit_sample_index) and its threshold is hardcoded in test04.py (DEFAULT_THRESHOLD = 0.10). Why re-design: as audits accumulate, a flat enum + dataset-borne params has nowhere to put per-test knobs and the YAML loader can't validate that the right params accompany the chosen audit. Suggested shape: a structured audit:/compliance: sub-config discriminated on the audit name, co-locating the selection + its parameters (threshold, counts, repeat index) — mirroring accuracy_config.

)


def verify_test04_from_reports(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] medium · design — QPS is compared across phases that issue different counts/contents.

The reference issues ref_samples distinct samples; the audit issues audit_samples of one repeated sample (example 50 vs 20). Upstream MLCommons TEST04 keeps the query set/count identical between the two runs. Comparing Report.qps() across runs that differ in count, content, and duration is not apples-to-apples — a short audit can post a higher instantaneous QPS for reasons unrelated to caching (ramp-up amortization, fewer tail effects). The completion-count guard (test04.py:128) only rejects the audit completing far fewer than the reference; it does nothing about the count/duration mismatch in the PASS direction. (Extends the maintainer's existing fairness thread; compounded by the ref_samples dead-write, which makes the reference duration-driven.) Suggested shape: drive both phases from one count (default audit_samples == ref_samples) and have the verdict assert that precondition.

return qps, report.n_samples_completed


def verify_test04_dirs(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] low · design — Two parallel verifier entry points.

verify_test04_dirs (disk, used by the CLI) and verify_test04_from_reports (in-memory, used by the integrated flow) both end in verify_test04, but only the dirs path carries the swapped-directory audit-marker guard (lines 185-194). Why re-design: a future audit must reimplement both surfaces and behavior can drift (the guard already exists in one but not the other). Suggested shape: one verify(ref_stats, audit_stats) core taking (qps, completed) tuples, with thin Path/Report adapters; the directory-order guard becomes an explicit precondition on the adapter.

f"audit test04 requires max_throughput (offline) or concurrency (single-stream); "
f"a paced load ({load_pattern_type.value}) caps throughput and hides caching."
)
if not 0 <= test04_sample_index < n_samples:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] low · bug — audit_sample_index is bound-checked against requested counts, not the loaded dataset size, until phase 2.

_validate_audit_test04 checks the index against the loaded dataset size, but only runs for the audit phase (the ref phase has audit=None and returns early). The schema field (ge=0 only) and the pre-flight in _run_benchmark_test04 check against ref_samples/audit_samples (requested counts), never the actual loaded row count. So audit_sample_index valid vs audit_samples but >= the real dataset size only fails inside the audit phase's SingleSampleOrder construction — after a full reference run has already executed and wasted time. Consider validating against the loaded dataset size up front.

@nvzhihanj

Copy link
Copy Markdown
Collaborator

Review Council — Multi-AI Code Review (first-principles design review)

Reviewed by: Claude · Depth: thorough
Codex review timed out on this 2046-line diff at xhigh reasoning (the load-gen + compliance surface is large); this pass is Claude-led. The one HIGH bug below was independently verified against the source.

Framing: TEST04 is the first MLPerf compliance/audit test and is meant to become a modular, extensible framework. The findings below are design-led — what would adding the next audit (TEST01/05) cost, and where does TEST04-specific knowledge leak into general-purpose code. 11 findings, all posted inline.

🔴 Re-design / Must-fix

# File Line Cat Why it needs a re-design
1 commands/benchmark/execute.py 1151 bug ref_samples is a dead write. Dataset.samples is consumed nowhere; ref_config never sets n_samples_to_issue, so the reference phase runs duration-driven and ignores ref_samples while the audit phase honors audit_samples → the compared phases run mismatched counts. Set n_samples_to_issue=ref_samples.
2 compliance/__init__.py 18 design No AuditTest abstraction. run_benchmark hardcodes if audit==TEST04; package exports only test04_*. Adding TEST01/05 = cross-cutting edits everywhere. Introduce an AuditTest protocol (plan_runs+verify) registered by AuditMode.
3 config/schema.py 82 design DatasetType.AUDIT is a fake dataset type the loader ignores, carrying test params on the shared Dataset model, then converted to PERFORMANCE. Move params to a structured audit: block; drop the fake type.
4 config/runtime_settings.py 90 design test04 boolean leaks into core load-gen. RuntimeSettings.test04/test04_sample_index + create_sample_order's if settings.test04. Use a generic sample-order strategy selector, not a per-test flag.

🟡 Should-fix

# File Line Cat Summary
5 commands/benchmark/execute.py 113 design _OVERRIDE_TEST04_SAMPLE_INDEX stringly-typed magic kwarg through **runtime_overrides; pass a typed run_spec instead.
6 commands/benchmark/execute.py 1146 design Two-phase model_copy surgery is fragile (root cause of #1; ref phase also skips _validate_audit_test04). Use a declarative RunSpec + validate before any phase runs.
7 tests/integration/commands/test_benchmark_command.py 209 testing _run_benchmark_test04 has no unit test; the one integration test asserts verdict OR error with min_duration_ms=0 — the regime that hides bug #1.
8 config/schema.py 666 design audit bare top-level enum; params scattered, threshold hardcoded. Use a structured compliance sub-config (like accuracy_config).
9 compliance/test04.py 206 design QPS compared across phases with different counts/contents (upstream TEST04 uses the same query set); completion guard only protects the FAIL direction. Extends the existing fairness thread; compounded by #1.

🔵 Consider

# File Line Cat Summary
10 compliance/test04.py 175 design verify_test04_dirs vs verify_test04_from_reports duplication; dir-swap guard in one path only. Collapse to one core + thin adapters.
11 commands/benchmark/execute.py 446 bug audit_sample_index bound-checked vs requested counts, not the loaded dataset size, until phase 2 — an out-of-range index wastes a full reference run.

Through-line: #1, #5, #6, #7 are all symptoms of the same root cause — TEST04 is bolted onto run_benchmark via per-phase config surgery and untyped overrides instead of a first-class audit-test abstraction (#2). Fixing #2/#3/#4 (an AuditTest that emits typed RunSpecs + a generic ordering strategy) would dissolve most of the others structurally.

Dedup: none overlap existing inline comments except #9, which extends the maintainer's existing fairness thread with upstream-parity / guard-direction substance.

target_concurrency: 1 # SingleStream: one request in-flight at a time

client:
num_workers: 1

@viraatc viraatc Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any specific reason to set num_workers?
can we use default instead?

(also in the other config yamls in this MR)

Comment thread src/inference_endpoint/config/schema.py Outdated
multi_turn: MultiTurnConfig | None = Field(
None, description="Multi-turn conversation configuration"
)
ref_samples: int | None = Field(

@viraatc viraatc Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems specific to test04, lets rename these flags to have the prefix test04_.

EDIT: its better to encapsulate under AuditConfig as per the comment below.

Comment thread src/inference_endpoint/config/schema.py Outdated
Literal[TestType.OFFLINE, TestType.ONLINE] | None,
cyclopts.Parameter(show=False),
] = None
audit: Annotated[

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 , audit should be a config class not simple true/false.
it should encapsulate all test-04 flags as well instead of placing these in Dataset.

ref_samples: int | None = Field
audit_samples: int | None = Field
audit_sample_index: int = Field

I suggest updating audit-mode -> audit-config class which is single place of definition of all compliance flags and defaults.

return bench


def _run_benchmark_test04(config: BenchmarkConfig, test_mode: TestMode) -> None:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to move test04 specific functionality and constants to compliance/test04.py?


def create_sample_order(settings: RuntimeSettings) -> SampleOrder:
"""Create a SampleOrder from RuntimeSettings."""
if settings.test04:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq: where is test04 defined on top of runtime settings?
not able to find its definition?

lets try to have encapsulate this smth like settings.audit_config.test == Constants.TEST04 for maintainability.

@viraatc viraatc left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Council — Multi-AI Code Review

Claude review (Codex unavailable). Focused on design integration quality per author request.

if audit_n is not None
else config.settings
)
audit_config = config.model_copy(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Review Council] critical · bugaudit_config retains audit=AuditMode.TEST04, creating a re-entrancy trap.

audit_config = config.model_copy(
    update={
        "report_dir": str(report_dir / "test04"),
        # audit is NOT nulled out here — cf. ref_config which sets "audit": None
    }
)

ref_config correctly sets "audit": None. audit_config does not. If run_benchmark(audit_config, …) is ever called instead of the internal _run_benchmark_single(audit_config, …), it routes back into _run_benchmark_test04() and loops indefinitely — the two-phase orchestrator calls itself on the audit sub-config. The distinction is currently enforced only by the call-site convention, not the type system. Add "audit": None to audit_config's update dict, matching ref_config.

return bench


def _run_benchmark_test04(config: BenchmarkConfig, test_mode: TestMode) -> None:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Review Council] high · correctness_run_benchmark_test04() always returns None: PASS, FAIL, and setup error are indistinguishable to the caller.

The standalone CLI (compliance/__main__.py) correctly uses sys.exit(0 if result.passed else 1), so automation can tell pass from fail. The integrated path silently logs and returns — there is no exit code, no raised exception, and no return value to indicate the verdict. Any script wrapping inference-endpoint benchmark … --audit test04 must parse the verdict JSON on disk because the process exit code is always 0 on FAIL.

The function should return Test04Result | None and run_benchmark() should propagate a non-zero exit code (or raise) on FAIL, matching the existing error-propagation contract.

"""Write the verdict as test04_verdict.json into output_dir; return its path."""
output_dir.mkdir(parents=True, exist_ok=True)
verdict_path = output_dir / "test04_verdict.json"
verdict_path.write_text(json.dumps(result.to_dict(), indent=2) + "\n")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Review Council] high · data-integritywrite_verdict() writes the verdict file non-atomically.

verdict_path.write_text(json.dumps(result.to_dict(), indent=2) + "\n")

Every other final-output write in this codebase (final_snapshot.json) uses tmp + fsync(file) + rename + fsync(parent_dir) to guarantee the file is either complete or absent. write_text is not crash-safe: a kill midway leaves a truncated JSON that downstream MLCommons validators read as corrupt. Should use the same atomic-write pattern used in MetricsPublisher.publish_final().

Results land in ``<report_dir>/reference/`` and ``<report_dir>/test04/``;
the verdict JSON is written to ``<report_dir>/test04_verdict.json``.
"""
report_dir = (

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Review Council] medium · design_run_benchmark_test04() duplicates setup_benchmark()'s report-dir creation and config.yaml write.

report_dir = Path(config.report_dir) if config.report_dir else _default_report_path()
report_dir.mkdir(parents=True, exist_ok=True)
config.to_yaml_file(report_dir / "config.yaml")

setup_benchmark() (called inside each _run_benchmark_single) already creates report_dir and writes config.yaml for the per-phase subdir. This outer block additionally creates the parent dir and writes a parent-level config.yaml. The result: config.yaml is written twice per run. More importantly, the parent-level path derivation is a copy of setup_benchmark()'s logic — future changes to one won't automatically propagate to the other. The parent-dir creation could be extracted into a shared helper, or the orchestrator could accept the resolved report_dir as a parameter rather than recomputing it.

If the runs carry an ``audit`` marker (results.json), guard against a
swapped directory order: the test04 dir must be the audit run and the
reference dir must not be."""
if _audit_marker(test04_dir) not in (None, _AUDIT_TEST04):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Review Council] low · error-handling_audit_marker(test04_dir) is called twice, reading and parsing the same file twice in the error path.

if _audit_marker(test04_dir) not in (None, _AUDIT_TEST04):
    raise ValueError(
        f"… (audit={_audit_marker(test04_dir)!r}); …"  # second read
    )

Capture once: marker = _audit_marker(test04_dir), then use marker in both the condition and the f-string.


"""MLPerf compliance checks for endpoint benchmarking."""

from .test04 import (

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Review Council] low · api-contractverify_test04_from_reports and write_verdict are missing from __all__.

Both are part of the public API of this module (called from execute.py via the internal from ..compliance.test04 import … path). Any external integrator importing from inference_endpoint.compliance would expect them to be available. Add both to __init__.py's __all__.

Comment thread pyproject.toml
"hdrhistogram==0.10.3",
# Color support for cross-platform terminals
"colorama==0.4.6",
# Fix pytz-2024 import warning

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Review Council] low · scope — Unrelated dependency changes bundled into this PR.

pip==26.1.2 pin and aiohttp 3.13.5 → 3.14.0 (+ corresponding uv.lock churn) have nothing to do with TEST04 compliance. Mixing dependency upgrades with a feature PR makes git bisect harder and conflates unrelated risk surfaces. Please split these into a dedicated dependency-update PR.

@viraatc

viraatc commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Review Council — Multi-AI Code Review

Reviewed by: Claude (Codex unavailable — bwrap restriction) | Depth: thorough | Focus: design integration quality

Found 11 issues7 posted inline (4 already covered by existing comments).


🔴 Must Fix

# File Line Category Finding
1 commands/benchmark/execute.py 1176 bug audit_config keeps audit=AuditMode.TEST04run_benchmark(audit_config) would recurse into _run_benchmark_test04() indefinitely. ref_config correctly nulls it; audit_config must too.
2 commands/benchmark/execute.py 1107 correctness _run_benchmark_test04() always returns None — PASS, FAIL, and setup error are indistinguishable. The standalone CLI exits 0/1 correctly; the integrated path never signals the verdict. Should return Test04Result | None.
3 compliance/test04.py 242 data-integrity write_verdict() uses write_text() — not atomic. Every other final-output write in this codebase uses tmp → fsync → rename → fsync(parent). A mid-write crash leaves a corrupt JSON for MLCommons validators.

🟡 Should Fix — the "ad-hoc integration" issues

These four findings describe where TEST04-specific concepts leaked into the wrong abstraction layer. Four are already noted by existing reviewers (skipped inline):

# File Line Covered? Finding
4 config/schema.py 82 ✅ existing DatasetType.AUDIT is not a dataset purpose — it's a run-mode signal. An audit dataset is a performance dataset. The AUDIT value forces every DatasetType consumer to handle a semantically different third member. Clean fix: remove AUDIT, put the audit configuration in a dedicated AuditConfig sub-model on BenchmarkConfig (same pattern as AccuracyConfig).
5 config/schema.py 321 ✅ existing ref_samples, audit_samples, audit_sample_index on Dataset pollute the universal schema. Every dataset now silently carries three TEST04-only fields. These belong in AuditConfig on BenchmarkConfig, not on Dataset.
6 config/runtime_settings.py 90 ✅ existing test04: bool + test04_sample_index: int in RuntimeSettings bake a compliance test name into the core execution struct. Generic replacement: audit_sample_index: int | None = None — sentinel-based, works for any future fixed-sample compliance mode. create_sample_order() then checks is not None instead of .test04.
7 commands/benchmark/execute.py 113 ✅ existing _OVERRIDE_TEST04_SAMPLE_INDEX string constant is a **kwargs side-channel because audit_sample_index lives on the wrong object (Dataset instead of a top-level AuditConfig). If findings 4–5 are adopted, RuntimeSettings.from_config(config, …) can read config.audit_config.sample_index directly — the constant and side-channel disappear.
8 commands/benchmark/execute.py 1115 🆕 _run_benchmark_test04() recomputes report_dir and writes config.yaml — duplicating setup_benchmark()'s logic. A future change to setup_benchmark()'s dir handling won't propagate here automatically.

🔵 Consider (low)

# File Line Finding
9 compliance/test04.py 185 _audit_marker(test04_dir) called twice — reads+parses the same JSON twice in the error path. Capture once: marker = _audit_marker(test04_dir).
10 compliance/__init__.py 18 verify_test04_from_reports and write_verdict missing from __all__. Both are public-API entry points.
11 pyproject.toml 73 pip==26.1.2 pin + aiohttp 3.13.5→3.14.0 are unrelated to TEST04. Separate PR.

Design summary: The compliance/test04.py verifier logic itself is clean and well-structured. The integration pain is concentrated in four places: DatasetType.AUDIT, the three fields on Dataset, the two fields on RuntimeSettings, and the _OVERRIDE_TEST04_SAMPLE_INDEX side-channel. All four disappear if an AuditConfig sub-model is introduced on BenchmarkConfig following the existing AccuracyConfig pattern.

wu6u3tw added a commit to wu6u3tw/endpoints that referenced this pull request Jun 10, 2026
Ground-up redesign of the compliance/audit framework after PR mlcommons#332 review.
Replaces the bolted-on TEST04 with a first-class, extensible AuditTest
abstraction: a generic orchestrator runs plan_runs() phases back-to-back at a
single shared sample count (fair comparison), and verify() produces a typed
verdict. Maps every PR mlcommons#332 design-review finding + maintainer workflow
requirement to where the design resolves it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@wu6u3tw wu6u3tw force-pushed the feat/test04-compliance branch from 5540585 to 9a0bb41 Compare June 10, 2026 18:25
@wu6u3tw wu6u3tw changed the title feat(compliance): add MLPerf TEST04 caching audit mode docs: compliance audit module redesign plan Jun 10, 2026
wu6u3tw added a commit to wu6u3tw/endpoints that referenced this pull request Jun 10, 2026
Ground-up redesign of the compliance/audit framework after PR mlcommons#332 review.
Replaces the bolted-on TEST04 with a first-class, extensible AuditTest
abstraction: a generic orchestrator runs plan_runs() phases back-to-back at a
single shared sample count (fair comparison), and verify() produces a typed
verdict. Maps every PR mlcommons#332 design-review finding + maintainer workflow
requirement to where the design resolves it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@wu6u3tw wu6u3tw force-pushed the feat/test04-compliance branch from 9a0bb41 to 4290b36 Compare June 10, 2026 21:33
wu6u3tw added a commit to wu6u3tw/endpoints that referenced this pull request Jun 10, 2026
Ground-up redesign of the compliance/audit framework after PR mlcommons#332 review.
Replaces the bolted-on TEST04 with a first-class, extensible AuditTest
abstraction: a generic orchestrator runs plan_runs() phases back-to-back at a
single shared sample count (fair comparison), and verify() produces a typed
verdict. Maps every PR mlcommons#332 design-review finding + maintainer workflow
requirement to where the design resolves it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@wu6u3tw wu6u3tw force-pushed the feat/test04-compliance branch from 4290b36 to d34459b Compare June 10, 2026 21:43
wu6u3tw added a commit to wu6u3tw/endpoints that referenced this pull request Jun 10, 2026
Ground-up redesign of the compliance/audit framework after PR mlcommons#332 review.
Replaces the bolted-on TEST04 with a first-class, extensible AuditTest
abstraction: a generic orchestrator runs plan_runs() phases back-to-back at a
single shared sample count (fair comparison), and verify() produces a typed
verdict. Maps every PR mlcommons#332 design-review finding + maintainer workflow
requirement to where the design resolves it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@wu6u3tw wu6u3tw force-pushed the feat/test04-compliance branch from d34459b to 642ec6c Compare June 10, 2026 22:20
Ground-up redesign of the compliance/audit framework after PR mlcommons#332 review.
Replaces the bolted-on TEST04 with a first-class, extensible AuditTest
abstraction: a generic orchestrator runs plan_runs() phases back-to-back at a
single shared sample count (fair comparison), and verify() produces a typed
verdict. Maps every PR mlcommons#332 design-review finding + maintainer workflow
requirement to where the design resolves it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@wu6u3tw wu6u3tw force-pushed the feat/test04-compliance branch from 642ec6c to b40b0ef Compare June 10, 2026 22:46
wu6u3tw and others added 7 commits June 10, 2026 15:57
submission_wan22_offline.yaml drives all three from one config: the perf
run, VBench accuracy scoring, and the additive TEST04 audit. The perf +
accuracy portion mirrors offline_wan22_accuracy.yaml; the audit: block
reflects the proposed schema in docs/compliance_audit_plan.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…lure handling

Verification against main found three doc issues:
- min_duration_ms is NOT a duration floor — the session stop check honors
  sample count and max_duration only, so the '10-min compliance minimum'
  claim was false. Documented as a current limitation (AND-semantics is
  future work) and dropped the misleading example comments.
- Specified verdict -> exit-code propagation (run_benchmark returns None
  today; the audit path must return the AuditVerdict for cli.py to sys.exit).
- Specified that a failed phase aborts without verifying (exit 2).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Re-litigated the comparison basis during plan review: keep equal-counts
(both phases issue one shared sample count) over duration-normalized, and
record it as an explicit decision with rationale (qps is rate-normalized so
count-equality is a simplification not a correctness requirement; duration
floor not yet supported), chosen defaults (offline 64, single-stream 20),
and the upgrade path. Align generic examples to samples=64.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The design doc referred to _run_audit(config, test_mode); the implementation
plan uses run_audit(config) (audit is performance-only, so test_mode is
irrelevant, and the function is imported cross-module so a public name fits).
Standardize the design doc on run_audit(config).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…t config union

Reverses the equal-count basis per design review: AuditConfig becomes a
per-test discriminated union (Test04Config) carrying samples (reference) +
audit_samples (audit, defaults to samples); verify drops the count-match guard
in favor of a per-phase completion guard + qps rate-normalization. Flags the
reviewer-conflict (this is the unequal-count pattern the maintainer questioned).
Adds submission_wan22_singlestream.yaml; sweeps the doc for stale equal-count
statements.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
submission_wan22_offline.yaml -> offline_wan22_submission.yaml
submission_wan22_singlestream.yaml -> single_stream_wan22_submission.yaml
Matches the existing offline_wan22.yaml / single_stream_wan22.yaml naming.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants