feat: Enable per dataset max-osl#344
Conversation
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
There was a problem hiding this comment.
Code Review
This pull request introduces a model_params_override field to the dataset configuration, allowing per-dataset overrides of the top-level model parameters. This is particularly useful for scenarios like MLPerf-style runs where accuracy and performance datasets require different configurations. The changes include validation of override keys, merging logic in effective_model_params, template updates, and comprehensive unit tests. Feedback from the reviewer highlights that the current implementation of effective_model_params performs a shallow merge, which could inadvertently overwrite entire nested dictionaries (such as osl_distribution). The reviewer suggests implementing a deep merge helper to preserve other nested configuration fields and adding corresponding unit tests to verify this behavior.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
arekay-nv
left a comment
There was a problem hiding this comment.
Review Council — Claude-only review (Codex failed: cloud policy auth error)
Review Council — Claude-only Code ReviewReviewed by: Claude (Codex failed — cloud workspace policy auth error) | Depth: standard (240 lines) Found 2 issues across 2 files:
Issue 1 (inline): Issue 2 (summary-only, line outside diff): |
| prompt: text_input | ||
| accuracy_config: null # Accuracy evaluation settings | ||
| multi_turn: null # Multi-turn conversation configuration | ||
| model_params_override: null # Per-dataset overrides for the top-level model_params (sparse — only the fields you want to override). Merged on top of BenchmarkConfig.model_params at dataset-load time. Useful for MLPerf-style runs where accuracy and performance use different output budgets in the same fleet, e.g. model_params_override: {max_new_tokens: 32768, streaming: 'on'}. |
There was a problem hiding this comment.
The name doesn't look right. Convention is we call the parameters of the model as model parameters, e.g. the one from: https://huggingface.co/moonshotai/Kimi-K2.6/blob/main/config.json. These are invariants unless we are debugging the model.
And the max OSL and streaming are generation config: https://huggingface.co/moonshotai/Kimi-K2.6/blob/main/generation_config.json (and also https://github.com/mlcommons/inference_results_v6.0/blob/main/closed/NVIDIA/code/deepseek-r1/tensorrt/generation_config.json which has more top_p, top_k). These are controlled by the client through parameters in the query
Right now it seems like we are hard-coding the generation config in the dataset (through transform) - ideally we should create generation config for each "benchmark" (because it's a per-run settings, not per-dataset). I would recommend that you add this to post-mortem and fix the hierarchy properly later.
If you want to WAR short-term, at least name it generation_configs_override and put TODO to fix them later
There was a problem hiding this comment.
Agree. The name is coming from the source field which show up as the model parameters, which isn't quite right. Since it is a generation config, and so should be a property of the dataset. The reason i say it is a property of the dataset is that we can have a situation where we want different osl for each dataset (like we have repeats for different accuracy datasets). Tying it with the dataset will also enable us to generate dynamic values (distributions) over a dataset independenty.
nvzhihanj
left a comment
There was a problem hiding this comment.
Review Council — first-principles WAR design review
Reviewed by: Claude (Codex review timed out — xhigh reasoning latency exceeds the 600s cap even on this small diff) · Depth: thorough
Scope: is the per-dataset max-OSL WAR correctly designed for a clean later refactor, given the long-term goal of per-phase generation config? 8 net-new findings (none duplicate the existing maintainer / gemini comments); verdict + long-term plan in the summary comment. The HIGH streaming data-integrity bug was verified against the source.
Review Council — first-principles WAR design reviewReviewed by: Claude · Depth: thorough This is a focused assessment of whether the per-dataset max-OSL WAR is correctly designed for a clean later refactor, given the long-term goal of per-phase generation config. 8 net-new findings posted inline — none duplicate the existing maintainer / gemini comments. Verdict on the WARIt achieves the immediate need (accuracy vs performance differ, since they're separate dataset entries) but is not yet "correct" as a WAR — three structural issues:
Minimal clean WAR: rename → Long-term: split 🔴 Must-fix (for the WAR to be correct)
🟡 Should-fix
🔵 Consider
Through-line: A, C, D, F are all symptoms of the same root cause — generation config is modeled as part of Dedup: builds on — does not duplicate — the maintainer's naming/per-run comment and gemini's shallow-merge note (D explicitly extends it). The |
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
fa1ba98 to
553a6d2
Compare
| multi_turn: MultiTurnConfig | None = Field( | ||
| None, description="Multi-turn conversation configuration" | ||
| ) | ||
| # TODO(post-mortem): generation config is per-phase (perf vs. accuracy), |
What does this PR do?
Enables the max-osl to be specified per-dataset. This enables running perf/accuracy runs with different OSL in the same benchmark.
Type of change
Related issues
Testing
Checklist