feat: Enable per dataset max-osl by arekay-nv · Pull Request #344 · mlcommons/endpoints

arekay-nv · 2026-06-06T21:36:49Z

What does this PR do?

Enables the max-osl to be specified per-dataset. This enables running perf/accuracy runs with different OSL in the same benchmark.

Type of change

Bug fix
New feature
Documentation update
Refactor/cleanup

Related issues

Testing

Tests added/updated
All tests pass locally
Manual testing completed

Checklist

Code follows project style
Pre-commit hooks pass
Documentation updated (if needed)

github-actions · 2026-06-06T21:36:58Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

gemini-code-assist

Code Review

This pull request introduces a model_params_override field to the dataset configuration, allowing per-dataset overrides of the top-level model parameters. This is particularly useful for scenarios like MLPerf-style runs where accuracy and performance datasets require different configurations. The changes include validation of override keys, merging logic in effective_model_params, template updates, and comprehensive unit tests. Feedback from the reviewer highlights that the current implementation of effective_model_params performs a shallow merge, which could inadvertently overwrite entire nested dictionaries (such as osl_distribution). The reviewer suggests implementing a deep merge helper to preserve other nested configuration fields and adding corresponding unit tests to verify this behavior.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

arekay-nv

Review Council — Claude-only review (Codex failed: cloud policy auth error)

arekay-nv · 2026-06-06T21:47:27Z

Review Council — Claude-only Code Review

Reviewed by: Claude (Codex failed — cloud workspace policy auth error) | Depth: standard (240 lines)

Found 2 issues across 2 files:

1 medium
1 low

#	File	Line	Severity	Category	Summary
1	`src/inference_endpoint/openai/completions_adapter.py`	65	medium	bug	`skip_special_tokens: False` in metadata dict never wired into `TextCompletionRequest` — silently dropped
2	`src/inference_endpoint/commands/benchmark/execute.py`	320	low	bug	FileNotFoundError handler uses `performance_cfgs[0].path` instead of the `perf_cfg` variable extracted on line 305 (outside diff hunk — not posted inline)

Issue 1 (inline): completions_adapter.py:65 — "skip_special_tokens": False is added to metadata and flows into query.data, but TextCompletionRequest has no such field and encode_query() never reads it. Either wire it through to the request or remove the entry.

Issue 2 (summary-only, line outside diff): execute.py:320 — the FileNotFoundError handler body still uses performance_cfgs[0].path instead of perf_cfg.path. This PR introduced perf_cfg = performance_cfgs[0] at line 305 to avoid repeated indexing, but the error message at line 320 was not updated. Minor consistency nit.

nvzhihanj · 2026-06-07T01:51:45Z

    prompt: text_input
  accuracy_config: null  # Accuracy evaluation settings
  multi_turn: null  # Multi-turn conversation configuration
+  model_params_override: null  # Per-dataset overrides for the top-level model_params (sparse — only the fields you want to override). Merged on top of BenchmarkConfig.model_params at dataset-load time. Useful for MLPerf-style runs where accuracy and performance use different output budgets in the same fleet, e.g. model_params_override: {max_new_tokens: 32768, streaming: 'on'}.


The name doesn't look right. Convention is we call the parameters of the model as model parameters, e.g. the one from: https://huggingface.co/moonshotai/Kimi-K2.6/blob/main/config.json. These are invariants unless we are debugging the model.

And the max OSL and streaming are generation config: https://huggingface.co/moonshotai/Kimi-K2.6/blob/main/generation_config.json (and also https://github.com/mlcommons/inference_results_v6.0/blob/main/closed/NVIDIA/code/deepseek-r1/tensorrt/generation_config.json which has more top_p, top_k). These are controlled by the client through parameters in the query

Right now it seems like we are hard-coding the generation config in the dataset (through transform) - ideally we should create generation config for each "benchmark" (because it's a per-run settings, not per-dataset). I would recommend that you add this to post-mortem and fix the hierarchy properly later.

If you want to WAR short-term, at least name it generation_configs_override and put TODO to fix them later

Agree. The name is coming from the source field which show up as the model parameters, which isn't quite right. Since it is a generation config, and so should be a property of the dataset. The reason i say it is a property of the dataset is that we can have a situation where we want different osl for each dataset (like we have repeats for different accuracy datasets). Tying it with the dataset will also enable us to generate dynamic values (distributions) over a dataset independenty.

nvzhihanj

Review Council — first-principles WAR design review

Reviewed by: Claude (Codex review timed out — xhigh reasoning latency exceeds the 600s cap even on this small diff) · Depth: thorough

Scope: is the per-dataset max-OSL WAR correctly designed for a clean later refactor, given the long-term goal of per-phase generation config? 8 net-new findings (none duplicate the existing maintainer / gemini comments); verdict + long-term plan in the summary comment. The HIGH streaming data-integrity bug was verified against the source.

nvzhihanj · 2026-06-07T05:27:38Z

Review Council — first-principles WAR design review

Reviewed by: Claude · Depth: thorough
Codex review timed out (its xhigh reasoning latency exceeds the 600s cap even on this 381-line diff); Claude-led. The HIGH streaming bug was verified against the source.

This is a focused assessment of whether the per-dataset max-OSL WAR is correctly designed for a clean later refactor, given the long-term goal of per-phase generation config. 8 net-new findings posted inline — none duplicate the existing maintainer / gemini comments.

Verdict on the WAR

It achieves the immediate need (accuracy vs performance differ, since they're separate dataset entries) but is not yet "correct" as a WAR — three structural issues:

Wrong axis. It's keyed to dataset identity, which cannot express the stated long-term goal ("different generation config for the same dataset across phases") — phases derive from datasets and duplicate (name,type) entries are rejected, so you'd have to duplicate the dataset entry.
Over-broad surface. It accepts all ModelParams keys, including three that silently misbehave per-dataset: name (→ ISL/OSL tokenized with the wrong tokenizer), streaming (→ aggregator ignores it; TTFT/TPOT unrecorded), osl_distribution (no consumer + shallow-merge degrades). The field's own documented example advertises streaming: 'on', which doesn't work.
Naming (the maintainer's point): model_params_override mislabels generation config as model parameters.

Minimal clean WAR: rename → generation_config_override; restrict accepted keys to the per-dataset-meaningful set (max_new_tokens + sampling params), rejecting name/streaming/osl_distribution; add # TODO(post-mortem): generation config is per-phase, not per-dataset; trim the template/docstring example accordingly.

Long-term: split ModelParams → ModelIdentity {name, tokenizer} (per-run invariant) + GenerationConfig {max_new_tokens, temperature, top_p, top_k, penalties, streaming, osl_distribution} (per-phase). Carry GenerationConfig on PhaseConfig; adapters' dataset_transforms read the phase's GenerationConfig, and the dataset becomes generation-agnostic. "Same dataset, different gen-config per phase" then falls out naturally (per-phase streaming additionally needs per-phase aggregator handling).

🔴 Must-fix (for the WAR to be correct)

#	File	Line	Cat	Summary
A	`config/schema.py`	317	data-integrity	Per-dataset `streaming` override reaches request rows but the single aggregator reads the global `streaming` (execute.py:185) → TTFT/TPOT silently unrecorded. The documented example advertises this broken key.
B	`config/schema.py`	309	design	Override bound to dataset identity is a dead-end for "same dataset, different gen-config per phase." WAR: rename + TODO. Long-term: per-phase `GenerationConfig`.
C	`config/schema.py`	329	design	Accepts all `ModelParams` keys incl. `name` → per-dataset `name` mismatches the global tokenizer/aggregator. Restrict to a per-dataset allowlist.

🟡 Should-fix

#	File	Line	Cat	Summary
D	`config/schema.py`	353	bug	(extends gemini) Shallow-merge of `osl_distribution` silently resets siblings and passes validation (all defaults), so the docstring's "catches bad overrides" is false; also `osl_distribution` has no consumer → overriding it is a no-op.
E	`tests/unit/commands/test_benchmark.py`	1144	testing	Gaps: nested/deep-merge, streaming end-to-end, multi-turn loader, and the completions adapter `max_tokens` key (tests only assert the chat `max_completion_tokens`, but the PR targets the completions path).
F	`config/templates/offline_template_full.yaml` (×3)	29	design	WAR baked into all three generated templates + the template-parse test → load-bearing / hard to undo; also advertises broken `streaming:'on'`. Rename now; trim the example.
G	`config/schema.py`	344	error-handling	Override value validation deferred to load-time (after setup writes report dir / config.yaml / probes tokenizer); move into `BenchmarkConfig._resolve_and_validate` to fail atomically at construction.

🔵 Consider

#	File	Line	Cat	Summary
H	`commands/benchmark/execute.py`	293	error-handling	Narrow `except Exception` → `(ValidationError, ValueError)` so a future bug isn't mislabeled "invalid model_params_override."

Through-line: A, C, D, F are all symptoms of the same root cause — generation config is modeled as part of ModelParams and attached to the dataset, so the override surface is both too wide (accepts identity/per-run keys) and on the wrong axis. Renaming + key-allowlisting makes the WAR honest now; the ModelIdentity / GenerationConfig split on PhaseConfig is the long-term fix.

Dedup: builds on — does not duplicate — the maintainer's naming/per-run comment and gemini's shallow-merge note (D explicitly extends it). The skip_special_tokens dead-code was already flagged by a prior review and is omitted here.

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

viraatc · 2026-06-09T02:48:08Z

    multi_turn: MultiTurnConfig | None = Field(
        None, description="Multi-turn conversation configuration"
    )
+    # TODO(post-mortem): generation config is per-phase (perf vs. accuracy),


arekay-nv requested a review from a team June 6, 2026 21:36

github-actions Bot requested a review from nvzhihanj June 6, 2026 21:36

gemini-code-assist Bot reviewed Jun 6, 2026

View reviewed changes

Comment thread src/inference_endpoint/config/schema.py Outdated

Comment thread tests/unit/config/test_schema.py Outdated

arekay-nv commented Jun 6, 2026

View reviewed changes

Comment thread src/inference_endpoint/openai/completions_adapter.py Outdated

arekay-nv commented Jun 6, 2026

View reviewed changes

Comment thread src/inference_endpoint/openai/completions_adapter.py Outdated

nvzhihanj reviewed Jun 7, 2026

View reviewed changes

Comment thread src/inference_endpoint/config/schema.py Outdated

nvzhihanj requested changes Jun 7, 2026

View reviewed changes

nvzhihanj reviewed Jun 7, 2026

View reviewed changes

viraatc approved these changes Jun 8, 2026

View reviewed changes

arekay-nv requested a review from nvzhihanj June 8, 2026 23:16

arekay-nv added 2 commits June 8, 2026 18:20

Enable per dataset max-osl

fbfccd9

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

Address comments

553a6d2

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

arekay-nv force-pushed the arekay/per_dataset_maxosl branch from fa1ba98 to 553a6d2 Compare June 8, 2026 23:35

viraatc reviewed Jun 9, 2026

View reviewed changes

arekay-nv mentioned this pull request Jun 13, 2026

feat: per-dataset max_new_tokens override #356

Open

10 tasks

nvzhihanj marked this pull request as draft June 15, 2026 03:39

Conversation

arekay-nv commented Jun 6, 2026

What does this PR do?

Type of change

Related issues

Testing

Checklist

Uh oh!

github-actions Bot commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

arekay-nv left a comment

Choose a reason for hiding this comment

Review Council — Claude-only review (Codex failed: cloud policy auth error)

Uh oh!

Uh oh!

Uh oh!

arekay-nv commented Jun 6, 2026

Review Council — Claude-only Code Review

Uh oh!

nvzhihanj Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arekay-nv Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nvzhihanj left a comment

Choose a reason for hiding this comment

Review Council — first-principles WAR design review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nvzhihanj commented Jun 7, 2026

Review Council — first-principles WAR design review

Verdict on the WAR

🔴 Must-fix (for the WAR to be correct)

🟡 Should-fix

🔵 Consider

Uh oh!

viraatc Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions Bot commented Jun 6, 2026 •

edited

Loading

nvzhihanj Jun 7, 2026 •

edited

Loading