Skip to content

fix: remove multiturn precompute#349

Open
tianmu-li wants to merge 11 commits into
mlcommons:mainfrom
tianmu-li:fix/remove_multiturn_precompute
Open

fix: remove multiturn precompute#349
tianmu-li wants to merge 11 commits into
mlcommons:mainfrom
tianmu-li:fix/remove_multiturn_precompute

Conversation

@tianmu-li

Copy link
Copy Markdown
Collaborator

What does this PR do?

Move ISL pre-compute to a separate file since ISL is not consumed during run. Also address the O(T^2) concerns in #336

Type of change

  • Bug fix
  • New feature
  • Documentation update
  • Refactor/cleanup

Related issues

Testing

  • Tests added/updated
  • All tests pass locally
  • Manual testing completed

Checklist

  • Code follows project style
  • Pre-commit hooks pass
  • Documentation updated (if needed)

tianmu-li and others added 7 commits June 9, 2026 03:35
…per, parallelize history building

- Remove _precompute_isl_for_multi_turn() from execute.py and its call
  site in setup_benchmark(); ISL is not consumed at runtime so the
  tokenizer load + full-dataset scan at startup was wasted time
- Add dataset_manager/multi_turn_isl.py as a self-contained offline
  helper exposing precompute_isl_for_multi_turn() for ISL distribution
  analysis outside the benchmark hot path
- Parallelize MultiTurnDataset._build_metadata() across conversations
  using ThreadPoolExecutor; turns within each conversation remain
  sequential (turn N depends on prior turns), but conversations are
  fully independent and can build their message history concurrently
- Delete tests scoped to the removed pre-compute call path
- Add unit tests for the new standalone helper

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…ild_metadata

Each client turn previously rescanned all prior rows to reconstruct the
message history from scratch. Replace with a single pass over sorted_group
carrying a running history list: each row is formatted once, client turns
snapshot (history + current_msgs), non-client rows extend history.

Remove ThreadPoolExecutor parallelization added alongside the O(N²) code —
with O(N) work per conversation the threading overhead dominates and GIL
contention on pure-Python dict operations means threads don't run in parallel.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
… reporting

Running the module directly loads a JSONL dataset, tokenizes all turns, and
prints min/mean/p50/p99/max ISL stats. Also exposes isl_distribution() as a
callable for programmatic use.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
The ISL helper has no library callers — nothing in production imports it.
Move to scripts/ to signal it is a developer tool, not a library module.
Remove the unit tests since the script is only run directly.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…ores

Load tokenizer once, share it across threads (fast tokenizer is thread-safe),
fan out apply_chat_template calls with ThreadPoolExecutor(os.cpu_count()).
Removes per-thread tokenizer loading which was issuing one HF Hub HTTP
request per thread.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- Fix os.cpu_count() None guard in ISL precompute script
- Remove history-narrating tail from O(N) pass comment
- Drop stale thread-safety claim from _build_conversation_metadata docstring

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@tianmu-li tianmu-li requested review from a team and Copilot June 9, 2026 04:24
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new offline ISL (Input Sequence Length) computation script for multi-turn datasets, refactors the dataset manager by extracting conversation metadata building into a helper function, and removes the old synchronous pre-computation logic and its associated tests. The review feedback highlights a few key areas for improvement: addressing an unreleased lock anti-pattern in the new script's tokenization error handling, resolving a type mismatch where a pandas Series is passed instead of a dictionary, and enhancing type safety by specifying element types in tuple type hints.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread scripts/agentic_inference_isl_precompute.py
Comment thread scripts/agentic_inference_isl_precompute.py
Comment thread src/inference_endpoint/dataset_manager/multi_turn_dataset.py
Comment thread src/inference_endpoint/dataset_manager/multi_turn_dataset.py Outdated
Comment thread src/inference_endpoint/dataset_manager/multi_turn_dataset.py

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Removes runtime ISL (Input Sequence Length) precompute from the benchmark execution path (since it isn’t consumed during normal runs), introduces an offline script for computing ISL distributions on multi-turn datasets, and refactors MultiTurnDataset metadata construction to reduce the prior O(T²) dataframe-slicing overhead.

Changes:

  • Refactor multi-turn metadata construction into a single-pass builder to avoid repeated per-turn dataframe filtering.
  • Remove runtime multi-turn ISL precompute from setup_benchmark() and delete its now-obsolete unit/integration tests.
  • Add scripts/multi_turn_isl.py to compute ISL token counts and print distribution stats offline.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/unit/commands/test_precompute_isl.py Removed unit tests for the deleted runtime ISL precompute helper.
tests/integration/test_multi_turn_metrics.py Removed integration tests that depended on runtime ISL precompute behavior.
src/inference_endpoint/dataset_manager/multi_turn_dataset.py Introduced a single-pass per-conversation metadata builder to reduce O(T²) overhead from repeated slicing.
src/inference_endpoint/commands/benchmark/execute.py Removed runtime ISL precompute and related imports/call sites.
scripts/multi_turn_isl.py Added an offline, parallel ISL precompute + distribution reporting utility.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/inference_endpoint/dataset_manager/multi_turn_dataset.py Outdated
Comment thread src/inference_endpoint/dataset_manager/multi_turn_dataset.py Outdated
tianmu-li and others added 2 commits June 9, 2026 04:39
- Replace unreleased-lock anti-pattern with bool flag + lock context manager
- Narrow dict[tuple, ...] annotations to dict[tuple[str, int], ...] throughout
- Pass row.to_dict() to _expand_tool_results to match its dict signature
- Replace history = history + list with history.extend() to avoid O(n) allocs

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…etadata

Broaden _expand_tool_results signature to dict | pd.Series so it accepts
the pd.Series yielded by iterrows() directly. Removes an O(1) but
per-row dict copy that was added to satisfy the type annotation.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@@ -0,0 +1,174 @@
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend changing the file name to agentic_benchmark_isl_precompute.py to be more explicit (and rename future files to agentic to stay consistent)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

ds.load()
_precompute_isl(ds, args.tokenizer)

stats = _isl_distribution(ds)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to understand, the precompute ISL is strictly for stats purposes, but not for actual runs?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. This will not be executed during benchmark.

Copilot AI review requested due to automatic review settings June 9, 2026 23:36
Signed-off-by: Li, Tianmu <tianmu.li@intel.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Comment thread scripts/agentic_inference_isl_precompute.py
Comment on lines +49 to +73
try:
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
except Exception:
logger.exception("Failed to load tokenizer %s", tokenizer_name)
return

first_failure_logged = False
first_failure_lock = threading.Lock()

def _tokenize_sample(sample: dict) -> list[int] | None:
try:
normalized_messages = []
for msg in sample["messages"]:
if msg.get("tool_calls"):
msg = {
**msg,
"tool_calls": _normalize_tool_calls_for_template(
msg["tool_calls"]
),
}
normalized_messages.append(msg)
tools = sample.get("tools")
raw = tokenizer.apply_chat_template(
normalized_messages,
tools=tools if tools else None,
Comment on lines 350 to 353
# Datasets
dataloader, accuracy_datasets, eval_configs = _load_datasets(config, report_dir)

if isinstance(dataloader, MultiTurnDataset) and tokenizer_name is not None:
logger.info("Pre-computing ISL token counts for multi-turn dataset…")
_precompute_isl_for_multi_turn(dataloader, tokenizer_name)

# Setup runtime settings using factory method
Signed-off-by: Li, Tianmu <tianmu.li@intel.com>

@nvzhihanj nvzhihanj left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Council — focused review: multi-turn conversation building

Reviewed by: Claude (Codex cross-checked the precompute-removal direction) · scope: the _build_conversation_metadata rewrite in multi_turn_dataset.py.

Verdict: the rewrite is correct and a genuine improvement. I diffed it against the merge-base: it is semantically equivalent to the old per-turn rebuild (same field extraction for history rows — incl. the pre-existing tool_calls/reasoning_content; the current turn is always a user/tool row where those don't apply or go through _expand_tool_results), and it changes O(T²) time → O(T) time. Turn-key collisions are impossible (the validator enforces turns are exactly consecutive 1..N); salt only touches system_content. Two robustness caveats are posted inline.

Confirmations (as asked):

  • Precompute ISL is stats-only. input_tokens is consumed only by the ISL metric (session.py:247) — never for scheduling, max-tokens, or sample counts; for multi-turn it's set to None at issue, so it never even reaches the metric. Removing it loses no run-time behavior.
  • Output metrics are OSL/timing-based. OSL = token_count(output), TPOT = output-duration/output-tokens, TTFT = timing — all independent of ISL.

Inline caveats:

  1. (design) Shared-dict aliasing — the O(T) speedup shares message dicts across all prefix snapshots; safe today but relies on an undocumented immutability invariant (the old code deep-copied). Document + test.
  2. (performance) O(T²) memory of pre_built_messages_by_key is unchanged (only time was fixed), and is built-then-discarded in live/agentic mode — skip it when use_dataset_history=False.

Aside (low, pre-existing, not posted inline): prior-turn assistant reasoning_content is replayed in the prefix; real agentic deployments usually strip prior CoT — a prompt-fidelity question, out of scope for this PR.

# Snapshot: history holds everything before this turn; create new lists
# so stored snapshots are not mutated by later history extensions.
pre_built_messages_by_key[(str_conv_id, t_n)] = (
list(history) + current_turn_msgs

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] medium · design — shared-dict aliasing: the O(T) speedup relies on an undocumented immutability invariant.

Each row's message dict is now formatted once and the same dict object is shared across history, every later pre_built_messages_by_key prefix snapshot, and current_turn_messages_by_key. (The old per-turn rebuild deep-copied each message into every snapshot, so snapshots were independent.) The comment here guards only the list aliasing (list(history) makes a fresh list); it does not guard the dict aliasing. It's safe today — transforms run on the dataframe before _build_metadata, salt only touches system_content, and the runtime paths only read/append — but a future per-turn mutation (a transform that edits a message in place, per-turn salt, an adapter mutating query.data['messages'][i]) would silently corrupt every turn that shares that dict. Suggest: document the "messages are immutable after build" invariant, and add a regression test that mutates one turn's message and asserts the other turns' prefixes are unchanged.

current_turn_msgs: list[dict] = row_msgs
# Snapshot: history holds everything before this turn; create new lists
# so stored snapshots are not mutated by later history extensions.
pre_built_messages_by_key[(str_conv_id, t_n)] = (

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] medium · performance — O(T²) memory remains, and is wasted in live/agentic mode.

This PR fixes the O(T²) time (history is built in one incremental pass), but pre_built_messages_by_key still stores every client turn's full prefix → O(T²) message-refs per conversation. Two points:

  1. The O(T²) memory the change set out to address (fix: parallel precompute for multi-turn datasets #336) is not removed — only the time. For long agentic conversations (hundreds of turns) the stored prefixes still grow quadratically.
  2. In live-history (agentic) mode it's pure waste: multi_turn_strategy.py:294 overrides the query with runtime live_messages = state.message_history.copy() + current_turn_messages, so the pre_built[key] that load() builds and attaches to each sample (sample["messages"] = pre_built[key]) is discarded at issue. Only replay mode (use_dataset_history=True) consumes pre_built.

Suggest: build/attach pre_built_messages_by_key only when use_dataset_history=True (live mode needs just current_turn_messages_by_key); and note in the PR that the O(T²) memory of storing all prefixes is unchanged (build prefixes lazily at issue if it ever matters for long conversations).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants