fix: remove multiturn precompute by tianmu-li · Pull Request #349 · mlcommons/endpoints

tianmu-li · 2026-06-09T04:24:56Z

What does this PR do?

Move ISL pre-compute to a separate file since ISL is not consumed during run. Also address the O(T^2) concerns in #336

Type of change

Bug fix
New feature
Documentation update
Refactor/cleanup

Related issues

Testing

Tests added/updated
All tests pass locally
Manual testing completed

Checklist

Code follows project style
Pre-commit hooks pass
Documentation updated (if needed)

…per, parallelize history building - Remove _precompute_isl_for_multi_turn() from execute.py and its call site in setup_benchmark(); ISL is not consumed at runtime so the tokenizer load + full-dataset scan at startup was wasted time - Add dataset_manager/multi_turn_isl.py as a self-contained offline helper exposing precompute_isl_for_multi_turn() for ISL distribution analysis outside the benchmark hot path - Parallelize MultiTurnDataset._build_metadata() across conversations using ThreadPoolExecutor; turns within each conversation remain sequential (turn N depends on prior turns), but conversations are fully independent and can build their message history concurrently - Delete tests scoped to the removed pre-compute call path - Add unit tests for the new standalone helper Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…ild_metadata Each client turn previously rescanned all prior rows to reconstruct the message history from scratch. Replace with a single pass over sorted_group carrying a running history list: each row is formatted once, client turns snapshot (history + current_msgs), non-client rows extend history. Remove ThreadPoolExecutor parallelization added alongside the O(N²) code — with O(N) work per conversation the threading overhead dominates and GIL contention on pure-Python dict operations means threads don't run in parallel. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

… reporting Running the module directly loads a JSONL dataset, tokenizes all turns, and prints min/mean/p50/p99/max ISL stats. Also exposes isl_distribution() as a callable for programmatic use. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

The ISL helper has no library callers — nothing in production imports it. Move to scripts/ to signal it is a developer tool, not a library module. Remove the unit tests since the script is only run directly. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…ores Load tokenizer once, share it across threads (fast tokenizer is thread-safe), fan out apply_chat_template calls with ThreadPoolExecutor(os.cpu_count()). Removes per-thread tokenizer loading which was issuing one HF Hub HTTP request per thread. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

- Fix os.cpu_count() None guard in ISL precompute script - Remove history-narrating tail from O(N) pass comment - Drop stale thread-safety claim from _build_conversation_metadata docstring Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-09T04:25:05Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

gemini-code-assist

Code Review

This pull request introduces a new offline ISL (Input Sequence Length) computation script for multi-turn datasets, refactors the dataset manager by extracting conversation metadata building into a helper function, and removes the old synchronous pre-computation logic and its associated tests. The review feedback highlights a few key areas for improvement: addressing an unreleased lock anti-pattern in the new script's tokenization error handling, resolving a type mismatch where a pandas Series is passed instead of a dictionary, and enhancing type safety by specifying element types in tuple type hints.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Copilot

Pull request overview

Removes runtime ISL (Input Sequence Length) precompute from the benchmark execution path (since it isn’t consumed during normal runs), introduces an offline script for computing ISL distributions on multi-turn datasets, and refactors MultiTurnDataset metadata construction to reduce the prior O(T²) dataframe-slicing overhead.

Changes:

Refactor multi-turn metadata construction into a single-pass builder to avoid repeated per-turn dataframe filtering.
Remove runtime multi-turn ISL precompute from setup_benchmark() and delete its now-obsolete unit/integration tests.
Add scripts/multi_turn_isl.py to compute ISL token counts and print distribution stats offline.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
tests/unit/commands/test_precompute_isl.py	Removed unit tests for the deleted runtime ISL precompute helper.
tests/integration/test_multi_turn_metrics.py	Removed integration tests that depended on runtime ISL precompute behavior.
src/inference_endpoint/dataset_manager/multi_turn_dataset.py	Introduced a single-pass per-conversation metadata builder to reduce O(T²) overhead from repeated slicing.
src/inference_endpoint/commands/benchmark/execute.py	Removed runtime ISL precompute and related imports/call sites.
scripts/multi_turn_isl.py	Added an offline, parallel ISL precompute + distribution reporting utility.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Replace unreleased-lock anti-pattern with bool flag + lock context manager - Narrow dict[tuple, ...] annotations to dict[tuple[str, int], ...] throughout - Pass row.to_dict() to _expand_tool_results to match its dict signature - Replace history = history + list with history.extend() to avoid O(n) allocs Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…etadata Broaden _expand_tool_results signature to dict | pd.Series so it accepts the pd.Series yielded by iterrows() directly. Removes an O(1) but per-row dict copy that was added to satisfy the type annotation. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

nvzhihanj · 2026-06-09T23:33:08Z

@@ -0,0 +1,174 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.


Recommend changing the file name to agentic_benchmark_isl_precompute.py to be more explicit (and rename future files to agentic to stay consistent)

nvzhihanj · 2026-06-09T23:33:37Z

+    ds.load()
+    _precompute_isl(ds, args.tokenizer)
+
+    stats = _isl_distribution(ds)


Just to understand, the precompute ISL is strictly for stats purposes, but not for actual runs?

Correct. This will not be executed during benchmark.

Signed-off-by: Li, Tianmu <tianmu.li@intel.com>

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

+    try:
+        tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
+    except Exception:
+        logger.exception("Failed to load tokenizer %s", tokenizer_name)
+        return
+
+    first_failure_logged = False
+    first_failure_lock = threading.Lock()
+
+    def _tokenize_sample(sample: dict) -> list[int] | None:
+        try:
+            normalized_messages = []
+            for msg in sample["messages"]:
+                if msg.get("tool_calls"):
+                    msg = {
+                        **msg,
+                        "tool_calls": _normalize_tool_calls_for_template(
+                            msg["tool_calls"]
+                        ),
+                    }
+                normalized_messages.append(msg)
+            tools = sample.get("tools")
+            raw = tokenizer.apply_chat_template(
+                normalized_messages,
+                tools=tools if tools else None,


    # Datasets
    dataloader, accuracy_datasets, eval_configs = _load_datasets(config, report_dir)

-    if isinstance(dataloader, MultiTurnDataset) and tokenizer_name is not None:
-        logger.info("Pre-computing ISL token counts for multi-turn dataset…")
-        _precompute_isl_for_multi_turn(dataloader, tokenizer_name)
-
    # Setup runtime settings using factory method


Signed-off-by: Li, Tianmu <tianmu.li@intel.com>

nvzhihanj

Review Council — focused review: multi-turn conversation building

Reviewed by: Claude (Codex cross-checked the precompute-removal direction) · scope: the _build_conversation_metadata rewrite in multi_turn_dataset.py.

Verdict: the rewrite is correct and a genuine improvement. I diffed it against the merge-base: it is semantically equivalent to the old per-turn rebuild (same field extraction for history rows — incl. the pre-existing tool_calls/reasoning_content; the current turn is always a user/tool row where those don't apply or go through _expand_tool_results), and it changes O(T²) time → O(T) time. Turn-key collisions are impossible (the validator enforces turns are exactly consecutive 1..N); salt only touches system_content. Two robustness caveats are posted inline.

Confirmations (as asked):

✅ Precompute ISL is stats-only. input_tokens is consumed only by the ISL metric (session.py:247) — never for scheduling, max-tokens, or sample counts; for multi-turn it's set to None at issue, so it never even reaches the metric. Removing it loses no run-time behavior.
✅ Output metrics are OSL/timing-based. OSL = token_count(output), TPOT = output-duration/output-tokens, TTFT = timing — all independent of ISL.

Inline caveats:

(design) Shared-dict aliasing — the O(T) speedup shares message dicts across all prefix snapshots; safe today but relies on an undocumented immutability invariant (the old code deep-copied). Document + test.
(performance) O(T²) memory of pre_built_messages_by_key is unchanged (only time was fixed), and is built-then-discarded in live/agentic mode — skip it when use_dataset_history=False.

Aside (low, pre-existing, not posted inline): prior-turn assistant reasoning_content is replayed in the prefix; real agentic deployments usually strip prior CoT — a prompt-fidelity question, out of scope for this PR.

nvzhihanj · 2026-06-09T23:58:55Z

+            # Snapshot: history holds everything before this turn; create new lists
+            # so stored snapshots are not mutated by later history extensions.
+            pre_built_messages_by_key[(str_conv_id, t_n)] = (
+                list(history) + current_turn_msgs


[Claude] medium · design — shared-dict aliasing: the O(T) speedup relies on an undocumented immutability invariant.

Each row's message dict is now formatted once and the same dict object is shared across history, every later pre_built_messages_by_key prefix snapshot, and current_turn_messages_by_key. (The old per-turn rebuild deep-copied each message into every snapshot, so snapshots were independent.) The comment here guards only the list aliasing (list(history) makes a fresh list); it does not guard the dict aliasing. It's safe today — transforms run on the dataframe before _build_metadata, salt only touches system_content, and the runtime paths only read/append — but a future per-turn mutation (a transform that edits a message in place, per-turn salt, an adapter mutating query.data['messages'][i]) would silently corrupt every turn that shares that dict. Suggest: document the "messages are immutable after build" invariant, and add a regression test that mutates one turn's message and asserts the other turns' prefixes are unchanged.

nvzhihanj · 2026-06-09T23:58:55Z

+            current_turn_msgs: list[dict] = row_msgs
+            # Snapshot: history holds everything before this turn; create new lists
+            # so stored snapshots are not mutated by later history extensions.
+            pre_built_messages_by_key[(str_conv_id, t_n)] = (


[Claude] medium · performance — O(T²) memory remains, and is wasted in live/agentic mode.

This PR fixes the O(T²) time (history is built in one incremental pass), but pre_built_messages_by_key still stores every client turn's full prefix → O(T²) message-refs per conversation. Two points:

The O(T²) memory the change set out to address (fix: parallel precompute for multi-turn datasets #336) is not removed — only the time. For long agentic conversations (hundreds of turns) the stored prefixes still grow quadratically.

In live-history (agentic) mode it's pure waste: multi_turn_strategy.py:294 overrides the query with runtime live_messages = state.message_history.copy() + current_turn_messages, so the pre_built[key] that load() builds and attaches to each sample (sample["messages"] = pre_built[key]) is discarded at issue. Only replay mode (use_dataset_history=True) consumes pre_built.

Suggest: build/attach pre_built_messages_by_key only when use_dataset_history=True (live mode needs just current_turn_messages_by_key); and note in the PR that the O(T²) memory of storing all prefixes is unchanged (build prefixes lazily at issue if it ever matters for long conversations).

tianmu-li and others added 7 commits June 9, 2026 03:35

chore: cap ISL precompute threads at 32

9da5cd2

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

tianmu-li requested review from a team and Copilot June 9, 2026 04:24

Copilot started reviewing on behalf of tianmu-li June 9, 2026 04:25 View session

gemini-code-assist Bot reviewed Jun 9, 2026

View reviewed changes

Copilot AI reviewed Jun 9, 2026

View reviewed changes

Comment thread src/inference_endpoint/dataset_manager/multi_turn_dataset.py Outdated

Comment thread src/inference_endpoint/dataset_manager/multi_turn_dataset.py Outdated

tianmu-li and others added 2 commits June 9, 2026 04:39

YibaiMeng mentioned this pull request Jun 9, 2026

feat: show progress bar during multi-turn ISL pre-compute #329

Closed

10 tasks

nvzhihanj reviewed Jun 9, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings June 9, 2026 23:36

Copilot started reviewing on behalf of tianmu-li June 9, 2026 23:36 View session

Rename script

478208c

Signed-off-by: Li, Tianmu <tianmu.li@intel.com>

Copilot AI reviewed Jun 9, 2026

View reviewed changes

Doc consistency

60459d9

Signed-off-by: Li, Tianmu <tianmu.li@intel.com>

nvzhihanj reviewed Jun 9, 2026

View reviewed changes

viraatc approved these changes Jun 10, 2026

View reviewed changes

		@@ -0,0 +1,174 @@
		# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Conversation

tianmu-li commented Jun 9, 2026

What does this PR do?

Type of change

Related issues

Testing

Checklist

Uh oh!

github-actions Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

nvzhihanj Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

tianmu-li Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

nvzhihanj Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

tianmu-li Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

nvzhihanj left a comment

Choose a reason for hiding this comment

Review Council — focused review: multi-turn conversation building

Uh oh!

nvzhihanj Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

nvzhihanj Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

github-actions Bot commented Jun 9, 2026 •

edited

Loading