fix: remove multiturn precompute#349
Conversation
…per, parallelize history building - Remove _precompute_isl_for_multi_turn() from execute.py and its call site in setup_benchmark(); ISL is not consumed at runtime so the tokenizer load + full-dataset scan at startup was wasted time - Add dataset_manager/multi_turn_isl.py as a self-contained offline helper exposing precompute_isl_for_multi_turn() for ISL distribution analysis outside the benchmark hot path - Parallelize MultiTurnDataset._build_metadata() across conversations using ThreadPoolExecutor; turns within each conversation remain sequential (turn N depends on prior turns), but conversations are fully independent and can build their message history concurrently - Delete tests scoped to the removed pre-compute call path - Add unit tests for the new standalone helper Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…ild_metadata Each client turn previously rescanned all prior rows to reconstruct the message history from scratch. Replace with a single pass over sorted_group carrying a running history list: each row is formatted once, client turns snapshot (history + current_msgs), non-client rows extend history. Remove ThreadPoolExecutor parallelization added alongside the O(N²) code — with O(N) work per conversation the threading overhead dominates and GIL contention on pure-Python dict operations means threads don't run in parallel. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
… reporting Running the module directly loads a JSONL dataset, tokenizes all turns, and prints min/mean/p50/p99/max ISL stats. Also exposes isl_distribution() as a callable for programmatic use. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
The ISL helper has no library callers — nothing in production imports it. Move to scripts/ to signal it is a developer tool, not a library module. Remove the unit tests since the script is only run directly. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…ores Load tokenizer once, share it across threads (fast tokenizer is thread-safe), fan out apply_chat_template calls with ThreadPoolExecutor(os.cpu_count()). Removes per-thread tokenizer loading which was issuing one HF Hub HTTP request per thread. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- Fix os.cpu_count() None guard in ISL precompute script - Remove history-narrating tail from O(N) pass comment - Drop stale thread-safety claim from _build_conversation_metadata docstring Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
There was a problem hiding this comment.
Code Review
This pull request introduces a new offline ISL (Input Sequence Length) computation script for multi-turn datasets, refactors the dataset manager by extracting conversation metadata building into a helper function, and removes the old synchronous pre-computation logic and its associated tests. The review feedback highlights a few key areas for improvement: addressing an unreleased lock anti-pattern in the new script's tokenization error handling, resolving a type mismatch where a pandas Series is passed instead of a dictionary, and enhancing type safety by specifying element types in tuple type hints.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
There was a problem hiding this comment.
Pull request overview
Removes runtime ISL (Input Sequence Length) precompute from the benchmark execution path (since it isn’t consumed during normal runs), introduces an offline script for computing ISL distributions on multi-turn datasets, and refactors MultiTurnDataset metadata construction to reduce the prior O(T²) dataframe-slicing overhead.
Changes:
- Refactor multi-turn metadata construction into a single-pass builder to avoid repeated per-turn dataframe filtering.
- Remove runtime multi-turn ISL precompute from
setup_benchmark()and delete its now-obsolete unit/integration tests. - Add
scripts/multi_turn_isl.pyto compute ISL token counts and print distribution stats offline.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/commands/test_precompute_isl.py | Removed unit tests for the deleted runtime ISL precompute helper. |
| tests/integration/test_multi_turn_metrics.py | Removed integration tests that depended on runtime ISL precompute behavior. |
| src/inference_endpoint/dataset_manager/multi_turn_dataset.py | Introduced a single-pass per-conversation metadata builder to reduce O(T²) overhead from repeated slicing. |
| src/inference_endpoint/commands/benchmark/execute.py | Removed runtime ISL precompute and related imports/call sites. |
| scripts/multi_turn_isl.py | Added an offline, parallel ISL precompute + distribution reporting utility. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Replace unreleased-lock anti-pattern with bool flag + lock context manager - Narrow dict[tuple, ...] annotations to dict[tuple[str, int], ...] throughout - Pass row.to_dict() to _expand_tool_results to match its dict signature - Replace history = history + list with history.extend() to avoid O(n) allocs Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…etadata Broaden _expand_tool_results signature to dict | pd.Series so it accepts the pd.Series yielded by iterrows() directly. Removes an O(1) but per-row dict copy that was added to satisfy the type annotation. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
| @@ -0,0 +1,174 @@ | |||
| # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
There was a problem hiding this comment.
Recommend changing the file name to agentic_benchmark_isl_precompute.py to be more explicit (and rename future files to agentic to stay consistent)
| ds.load() | ||
| _precompute_isl(ds, args.tokenizer) | ||
|
|
||
| stats = _isl_distribution(ds) |
There was a problem hiding this comment.
Just to understand, the precompute ISL is strictly for stats purposes, but not for actual runs?
There was a problem hiding this comment.
Correct. This will not be executed during benchmark.
Signed-off-by: Li, Tianmu <tianmu.li@intel.com>
| try: | ||
| tokenizer = AutoTokenizer.from_pretrained(tokenizer_name) | ||
| except Exception: | ||
| logger.exception("Failed to load tokenizer %s", tokenizer_name) | ||
| return | ||
|
|
||
| first_failure_logged = False | ||
| first_failure_lock = threading.Lock() | ||
|
|
||
| def _tokenize_sample(sample: dict) -> list[int] | None: | ||
| try: | ||
| normalized_messages = [] | ||
| for msg in sample["messages"]: | ||
| if msg.get("tool_calls"): | ||
| msg = { | ||
| **msg, | ||
| "tool_calls": _normalize_tool_calls_for_template( | ||
| msg["tool_calls"] | ||
| ), | ||
| } | ||
| normalized_messages.append(msg) | ||
| tools = sample.get("tools") | ||
| raw = tokenizer.apply_chat_template( | ||
| normalized_messages, | ||
| tools=tools if tools else None, |
| # Datasets | ||
| dataloader, accuracy_datasets, eval_configs = _load_datasets(config, report_dir) | ||
|
|
||
| if isinstance(dataloader, MultiTurnDataset) and tokenizer_name is not None: | ||
| logger.info("Pre-computing ISL token counts for multi-turn dataset…") | ||
| _precompute_isl_for_multi_turn(dataloader, tokenizer_name) | ||
|
|
||
| # Setup runtime settings using factory method |
Signed-off-by: Li, Tianmu <tianmu.li@intel.com>
nvzhihanj
left a comment
There was a problem hiding this comment.
Review Council — focused review: multi-turn conversation building
Reviewed by: Claude (Codex cross-checked the precompute-removal direction) · scope: the _build_conversation_metadata rewrite in multi_turn_dataset.py.
Verdict: the rewrite is correct and a genuine improvement. I diffed it against the merge-base: it is semantically equivalent to the old per-turn rebuild (same field extraction for history rows — incl. the pre-existing tool_calls/reasoning_content; the current turn is always a user/tool row where those don't apply or go through _expand_tool_results), and it changes O(T²) time → O(T) time. Turn-key collisions are impossible (the validator enforces turns are exactly consecutive 1..N); salt only touches system_content. Two robustness caveats are posted inline.
Confirmations (as asked):
- ✅ Precompute ISL is stats-only.
input_tokensis consumed only by the ISL metric (session.py:247) — never for scheduling, max-tokens, or sample counts; for multi-turn it's set toNoneat issue, so it never even reaches the metric. Removing it loses no run-time behavior. - ✅ Output metrics are OSL/timing-based. OSL =
token_count(output), TPOT = output-duration/output-tokens, TTFT = timing — all independent of ISL.
Inline caveats:
- (design) Shared-dict aliasing — the O(T) speedup shares message dicts across all prefix snapshots; safe today but relies on an undocumented immutability invariant (the old code deep-copied). Document + test.
- (performance) O(T²) memory of
pre_built_messages_by_keyis unchanged (only time was fixed), and is built-then-discarded in live/agentic mode — skip it whenuse_dataset_history=False.
Aside (low, pre-existing, not posted inline): prior-turn assistant reasoning_content is replayed in the prefix; real agentic deployments usually strip prior CoT — a prompt-fidelity question, out of scope for this PR.
| # Snapshot: history holds everything before this turn; create new lists | ||
| # so stored snapshots are not mutated by later history extensions. | ||
| pre_built_messages_by_key[(str_conv_id, t_n)] = ( | ||
| list(history) + current_turn_msgs |
There was a problem hiding this comment.
[Claude] medium · design — shared-dict aliasing: the O(T) speedup relies on an undocumented immutability invariant.
Each row's message dict is now formatted once and the same dict object is shared across history, every later pre_built_messages_by_key prefix snapshot, and current_turn_messages_by_key. (The old per-turn rebuild deep-copied each message into every snapshot, so snapshots were independent.) The comment here guards only the list aliasing (list(history) makes a fresh list); it does not guard the dict aliasing. It's safe today — transforms run on the dataframe before _build_metadata, salt only touches system_content, and the runtime paths only read/append — but a future per-turn mutation (a transform that edits a message in place, per-turn salt, an adapter mutating query.data['messages'][i]) would silently corrupt every turn that shares that dict. Suggest: document the "messages are immutable after build" invariant, and add a regression test that mutates one turn's message and asserts the other turns' prefixes are unchanged.
| current_turn_msgs: list[dict] = row_msgs | ||
| # Snapshot: history holds everything before this turn; create new lists | ||
| # so stored snapshots are not mutated by later history extensions. | ||
| pre_built_messages_by_key[(str_conv_id, t_n)] = ( |
There was a problem hiding this comment.
[Claude] medium · performance — O(T²) memory remains, and is wasted in live/agentic mode.
This PR fixes the O(T²) time (history is built in one incremental pass), but pre_built_messages_by_key still stores every client turn's full prefix → O(T²) message-refs per conversation. Two points:
- The O(T²) memory the change set out to address (fix: parallel precompute for multi-turn datasets #336) is not removed — only the time. For long agentic conversations (hundreds of turns) the stored prefixes still grow quadratically.
- In live-history (agentic) mode it's pure waste:
multi_turn_strategy.py:294overrides the query with runtimelive_messages = state.message_history.copy() + current_turn_messages, so thepre_built[key]thatload()builds and attaches to each sample (sample["messages"] = pre_built[key]) is discarded at issue. Only replay mode (use_dataset_history=True) consumespre_built.
Suggest: build/attach pre_built_messages_by_key only when use_dataset_history=True (live mode needs just current_turn_messages_by_key); and note in the PR that the O(T²) memory of storing all prefixes is unchanged (build prefixes lazily at issue if it ever matters for long conversations).
What does this PR do?
Move ISL pre-compute to a separate file since ISL is not consumed during run. Also address the O(T^2) concerns in #336
Type of change
Related issues
Testing
Checklist