Skip to content

feat: Enable per dataset max-osl#344

Draft
arekay-nv wants to merge 2 commits into
mainfrom
arekay/per_dataset_maxosl
Draft

feat: Enable per dataset max-osl#344
arekay-nv wants to merge 2 commits into
mainfrom
arekay/per_dataset_maxosl

Conversation

@arekay-nv

Copy link
Copy Markdown
Collaborator

What does this PR do?

Enables the max-osl to be specified per-dataset. This enables running perf/accuracy runs with different OSL in the same benchmark.

Type of change

  • Bug fix
  • New feature
  • Documentation update
  • Refactor/cleanup

Related issues

Testing

  • Tests added/updated
  • All tests pass locally
  • Manual testing completed

Checklist

  • Code follows project style
  • Pre-commit hooks pass
  • Documentation updated (if needed)

@arekay-nv arekay-nv requested a review from a team June 6, 2026 21:36
@github-actions

github-actions Bot commented Jun 6, 2026

Copy link
Copy Markdown

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@github-actions github-actions Bot requested a review from nvzhihanj June 6, 2026 21:36

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a model_params_override field to the dataset configuration, allowing per-dataset overrides of the top-level model parameters. This is particularly useful for scenarios like MLPerf-style runs where accuracy and performance datasets require different configurations. The changes include validation of override keys, merging logic in effective_model_params, template updates, and comprehensive unit tests. Feedback from the reviewer highlights that the current implementation of effective_model_params performs a shallow merge, which could inadvertently overwrite entire nested dictionaries (such as osl_distribution). The reviewer suggests implementing a deep merge helper to preserve other nested configuration fields and adding corresponding unit tests to verify this behavior.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread src/inference_endpoint/config/schema.py Outdated
Comment thread tests/unit/config/test_schema.py Outdated

@arekay-nv arekay-nv left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Council — Claude-only review (Codex failed: cloud policy auth error)

Comment thread src/inference_endpoint/openai/completions_adapter.py Outdated
Comment thread src/inference_endpoint/openai/completions_adapter.py Outdated
@arekay-nv

Copy link
Copy Markdown
Collaborator Author

Review Council — Claude-only Code Review

Reviewed by: Claude (Codex failed — cloud workspace policy auth error) | Depth: standard (240 lines)

Found 2 issues across 2 files:

  • 1 medium
  • 1 low
# File Line Severity Category Summary
1 src/inference_endpoint/openai/completions_adapter.py 65 medium bug skip_special_tokens: False in metadata dict never wired into TextCompletionRequest — silently dropped
2 src/inference_endpoint/commands/benchmark/execute.py 320 low bug FileNotFoundError handler uses performance_cfgs[0].path instead of the perf_cfg variable extracted on line 305 (outside diff hunk — not posted inline)

Issue 1 (inline): completions_adapter.py:65"skip_special_tokens": False is added to metadata and flows into query.data, but TextCompletionRequest has no such field and encode_query() never reads it. Either wire it through to the request or remove the entry.

Issue 2 (summary-only, line outside diff): execute.py:320 — the FileNotFoundError handler body still uses performance_cfgs[0].path instead of perf_cfg.path. This PR introduced perf_cfg = performance_cfgs[0] at line 305 to avoid repeated indexing, but the error message at line 320 was not updated. Minor consistency nit.

prompt: text_input
accuracy_config: null # Accuracy evaluation settings
multi_turn: null # Multi-turn conversation configuration
model_params_override: null # Per-dataset overrides for the top-level model_params (sparse — only the fields you want to override). Merged on top of BenchmarkConfig.model_params at dataset-load time. Useful for MLPerf-style runs where accuracy and performance use different output budgets in the same fleet, e.g. model_params_override: {max_new_tokens: 32768, streaming: 'on'}.

@nvzhihanj nvzhihanj Jun 7, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name doesn't look right. Convention is we call the parameters of the model as model parameters, e.g. the one from: https://huggingface.co/moonshotai/Kimi-K2.6/blob/main/config.json. These are invariants unless we are debugging the model.

And the max OSL and streaming are generation config: https://huggingface.co/moonshotai/Kimi-K2.6/blob/main/generation_config.json (and also https://github.com/mlcommons/inference_results_v6.0/blob/main/closed/NVIDIA/code/deepseek-r1/tensorrt/generation_config.json which has more top_p, top_k). These are controlled by the client through parameters in the query

Right now it seems like we are hard-coding the generation config in the dataset (through transform) - ideally we should create generation config for each "benchmark" (because it's a per-run settings, not per-dataset). I would recommend that you add this to post-mortem and fix the hierarchy properly later.

If you want to WAR short-term, at least name it generation_configs_override and put TODO to fix them later

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. The name is coming from the source field which show up as the model parameters, which isn't quite right. Since it is a generation config, and so should be a property of the dataset. The reason i say it is a property of the dataset is that we can have a situation where we want different osl for each dataset (like we have repeats for different accuracy datasets). Tying it with the dataset will also enable us to generate dynamic values (distributions) over a dataset independenty.

Comment thread src/inference_endpoint/config/schema.py Outdated

@nvzhihanj nvzhihanj left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Council — first-principles WAR design review

Reviewed by: Claude (Codex review timed out — xhigh reasoning latency exceeds the 600s cap even on this small diff) · Depth: thorough

Scope: is the per-dataset max-OSL WAR correctly designed for a clean later refactor, given the long-term goal of per-phase generation config? 8 net-new findings (none duplicate the existing maintainer / gemini comments); verdict + long-term plan in the summary comment. The HIGH streaming data-integrity bug was verified against the source.

Comment thread src/inference_endpoint/config/schema.py Outdated
Comment thread src/inference_endpoint/config/schema.py Outdated
Comment thread src/inference_endpoint/config/schema.py Outdated
Comment thread src/inference_endpoint/config/schema.py Outdated
Comment thread tests/unit/commands/test_benchmark.py Outdated
Comment thread src/inference_endpoint/config/templates/offline_template_full.yaml Outdated
Comment thread src/inference_endpoint/config/schema.py Outdated
Comment thread src/inference_endpoint/commands/benchmark/execute.py Outdated
@nvzhihanj

Copy link
Copy Markdown
Collaborator

Review Council — first-principles WAR design review

Reviewed by: Claude · Depth: thorough
Codex review timed out (its xhigh reasoning latency exceeds the 600s cap even on this 381-line diff); Claude-led. The HIGH streaming bug was verified against the source.

This is a focused assessment of whether the per-dataset max-OSL WAR is correctly designed for a clean later refactor, given the long-term goal of per-phase generation config. 8 net-new findings posted inline — none duplicate the existing maintainer / gemini comments.

Verdict on the WAR

It achieves the immediate need (accuracy vs performance differ, since they're separate dataset entries) but is not yet "correct" as a WAR — three structural issues:

  1. Wrong axis. It's keyed to dataset identity, which cannot express the stated long-term goal ("different generation config for the same dataset across phases") — phases derive from datasets and duplicate (name,type) entries are rejected, so you'd have to duplicate the dataset entry.
  2. Over-broad surface. It accepts all ModelParams keys, including three that silently misbehave per-dataset: name (→ ISL/OSL tokenized with the wrong tokenizer), streaming (→ aggregator ignores it; TTFT/TPOT unrecorded), osl_distribution (no consumer + shallow-merge degrades). The field's own documented example advertises streaming: 'on', which doesn't work.
  3. Naming (the maintainer's point): model_params_override mislabels generation config as model parameters.

Minimal clean WAR: rename → generation_config_override; restrict accepted keys to the per-dataset-meaningful set (max_new_tokens + sampling params), rejecting name/streaming/osl_distribution; add # TODO(post-mortem): generation config is per-phase, not per-dataset; trim the template/docstring example accordingly.

Long-term: split ModelParamsModelIdentity {name, tokenizer} (per-run invariant) + GenerationConfig {max_new_tokens, temperature, top_p, top_k, penalties, streaming, osl_distribution} (per-phase). Carry GenerationConfig on PhaseConfig; adapters' dataset_transforms read the phase's GenerationConfig, and the dataset becomes generation-agnostic. "Same dataset, different gen-config per phase" then falls out naturally (per-phase streaming additionally needs per-phase aggregator handling).

🔴 Must-fix (for the WAR to be correct)

# File Line Cat Summary
A config/schema.py 317 data-integrity Per-dataset streaming override reaches request rows but the single aggregator reads the global streaming (execute.py:185) → TTFT/TPOT silently unrecorded. The documented example advertises this broken key.
B config/schema.py 309 design Override bound to dataset identity is a dead-end for "same dataset, different gen-config per phase." WAR: rename + TODO. Long-term: per-phase GenerationConfig.
C config/schema.py 329 design Accepts all ModelParams keys incl. name → per-dataset name mismatches the global tokenizer/aggregator. Restrict to a per-dataset allowlist.

🟡 Should-fix

# File Line Cat Summary
D config/schema.py 353 bug (extends gemini) Shallow-merge of osl_distribution silently resets siblings and passes validation (all defaults), so the docstring's "catches bad overrides" is false; also osl_distribution has no consumer → overriding it is a no-op.
E tests/unit/commands/test_benchmark.py 1144 testing Gaps: nested/deep-merge, streaming end-to-end, multi-turn loader, and the completions adapter max_tokens key (tests only assert the chat max_completion_tokens, but the PR targets the completions path).
F config/templates/offline_template_full.yaml (×3) 29 design WAR baked into all three generated templates + the template-parse test → load-bearing / hard to undo; also advertises broken streaming:'on'. Rename now; trim the example.
G config/schema.py 344 error-handling Override value validation deferred to load-time (after setup writes report dir / config.yaml / probes tokenizer); move into BenchmarkConfig._resolve_and_validate to fail atomically at construction.

🔵 Consider

# File Line Cat Summary
H commands/benchmark/execute.py 293 error-handling Narrow except Exception(ValidationError, ValueError) so a future bug isn't mislabeled "invalid model_params_override."

Through-line: A, C, D, F are all symptoms of the same root cause — generation config is modeled as part of ModelParams and attached to the dataset, so the override surface is both too wide (accepts identity/per-run keys) and on the wrong axis. Renaming + key-allowlisting makes the WAR honest now; the ModelIdentity / GenerationConfig split on PhaseConfig is the long-term fix.

Dedup: builds on — does not duplicate — the maintainer's naming/per-run comment and gemini's shallow-merge note (D explicitly extends it). The skip_special_tokens dead-code was already flagged by a prior review and is omitted here.

@arekay-nv arekay-nv requested a review from nvzhihanj June 8, 2026 23:16
arekay-nv added 2 commits June 8, 2026 18:20
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
@arekay-nv arekay-nv force-pushed the arekay/per_dataset_maxosl branch from fa1ba98 to 553a6d2 Compare June 8, 2026 23:35
multi_turn: MultiTurnConfig | None = Field(
None, description="Multi-turn conversation configuration"
)
# TODO(post-mortem): generation config is per-phase (perf vs. accuracy),

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@nvzhihanj nvzhihanj marked this pull request as draft June 15, 2026 03:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants