mlcommons · hvagadia · Jun 3, 2026 · Jun 3, 2026 · Jun 3, 2026 · Jun 11, 2026
@@ -51,36 +51,125 @@ clients; set `model_params.name` in the YAML to the same value.
 The runnable config is
 `examples/09_MultiTurn/kimi_agentic_benchmark.yaml`.
 
-Key fields:
-
-- `type: online`: runs the benchmark through the online scheduler.
-- `model_params.name`: model name sent in each OpenAI request. Keep it aligned
-  with the served model name.
-- `model_params.temperature`, `top_p`, `max_new_tokens`: sampling settings sent
-  to the server. `max_new_tokens` is large because agent turns can be long.
-- `model_params.chat_template_kwargs`: Kimi-specific template options for
-  reasoning preservation.
-- First `datasets` entry `name`: label used in benchmark outputs.
-- First `datasets` entry `type: performance`: multi-turn datasets are replayed as
-  performance datasets.
-- First `datasets` entry `path`: JSONL dataset path to run.
-- First `datasets` entry `multi_turn.turn_timeout_s`: per-turn deadline. A
-  timeout aborts the remaining turns in that conversation.
-- First `datasets` entry `multi_turn.enable_salt`: appends a deterministic cache
-  salt to each conversation system prompt.
-- First `datasets` entry `multi_turn.inject_tool_delay`: honors positive
-  `delay_seconds` values from client turns before issuing those turns.
-- `settings.runtime.min_duration_ms`: minimum run duration. With no max duration
-  override, the run finishes when the dataset is exhausted.
-- `settings.load_pattern.type: multi_turn`: enables conversation-aware issuing.
-- `settings.load_pattern.target_concurrency`: maximum active conversations.
-  Each active conversation has at most one in-flight request.
-- `settings.client.warmup_connections: 0`: avoids stale pre-warmed sockets with
-  servers that close idle connections quickly.
-- `settings.client.max_idle_time`: connection idle lifetime.
-- `endpoint_config.endpoints`: server URL list.
-- `endpoint_config.api_type: openai`: use `/v1/chat/completions`.
-- `report_dir`: output directory for events, snapshots, and reports.
+### Fields
+
+- `name`: human-readable run name written to reports and logs. Change this when
+  creating a distinct benchmark config.
+- `version`: config version label for this example.
+- `type`: scheduler mode for the run.
+- `model_params.name`: model name sent in each OpenAI request. Set this to the
+  model name served by the endpoint.
+- `model_params.temperature`: sampling temperature sent to the server.
+- `model_params.top_p`: nucleus sampling value sent to the server.
+- `model_params.max_new_tokens`: per-turn generation cap.
+- `model_params.chat_template_kwargs.thinking`: Kimi chat-template option.
+- `model_params.chat_template_kwargs.preserve_thinking`: preserves
+  reasoning content in the rendered prompt.
+- First dataset `name`: label used in benchmark outputs. Change this to match
+  the dataset variant being run.
+- First dataset `type`: dataset role for this entry.
+- First dataset `path`: JSONL dataset path to run. Set this to a real local or
+  mounted dataset path, for example `/path/to/agentic_combined.jsonl`.
+- First dataset `accuracy_config.eval_method`: scorer used during finalization.
+  `multi_turn_inline` scores the performance replay outputs without issuing a
+  separate accuracy phase.
+- First dataset `multi_turn.enable_salt`: applies deterministic salt
+  markers when issuing conversation instances so repeats do not reuse KV cache
+  by accident.
+- First dataset `multi_turn.inject_tool_delay`: honors positive
+  `delay_seconds` values from the dataset before issuing user/tool turns.
+- First dataset `multi_turn.num_trajectories_to_issue`: total number of
+  trajectories to start. Change this to scale runtime.
+- First dataset `multi_turn.stop_issuing_on_first_user_complete`: controls only
+  whether the client keeps issuing after the measurement window ends. Performance
+  tracking always stops when the first concurrency slot finishes a trajectory and
+  there is no next trajectory left to assign. If this field is `true`, the client
+  stops issuing future turns at that point and drains already in-flight turns. If
+  this field is `false`, the client keeps replaying already-started active
+  trajectories to completion for accuracy/log coverage, but those later-issued
+  turns are outside the performance measurement window.
+- `settings.runtime.min_duration_ms`: minimum run duration. Multi-turn replay
+  completion is controlled by trajectory budget and active conversation drain.
+- `settings.load_pattern.type`: enables conversation-aware issuing.
+- `settings.load_pattern.target_concurrency`: maximum active conversations. Each
+  active conversation has at most one in-flight request. Change this for the
+  target concurrency of the run.
+- `settings.client.warmup_connections`: disables pre-warmed HTTP sockets.
+- `settings.client.max_idle_time`: connection idle lifetime in seconds.
+- `endpoint_config.endpoints`: server URL list. Replace with the endpoint URLs
+  for the run.
+- `endpoint_config.api_type`: selects the endpoint protocol and route.
+- `report_dir`: output directory for events, snapshots, scores, and reports.
+  Change this per run so outputs are not overwritten.
+
+### Benchmark Invariants
+
+For official Kimi agentic benchmark runs, keep these values fixed:
+
+- `version: "1.0"`
+- `type: "online"`
+- `model_params.temperature: 1.0`
+- `model_params.top_p: 0.95`
+- `model_params.max_new_tokens: 8192`
+- `model_params.chat_template_kwargs.thinking: true`
+- `model_params.chat_template_kwargs.preserve_thinking: true`
+- First dataset `type: performance`
+- First dataset `accuracy_config.eval_method: multi_turn_inline`
+- `settings.runtime.min_duration_ms: 0`
+- `settings.load_pattern.type: multi_turn`
+- `settings.client.warmup_connections: 0`
+- `settings.client.max_idle_time: 0.5`
+- `endpoint_config.api_type: openai`
+
+The multi-turn dataset required defaults are:
+
+- First dataset `multi_turn.enable_salt: true`
+- First dataset `multi_turn.inject_tool_delay: true`
+- First dataset `multi_turn.stop_issuing_on_first_user_complete: false`
+
+Set `multi_turn.num_trajectories_to_issue` to an integer multiple of the
+dataset trajectory count so each repeat has the same representation. Use
+`multi_turn.stop_issuing_on_first_user_complete: true` only for faster
+optimization/debug runs, not official benchmark runs.
+
+### Salting Mechanism
+
+When `multi_turn.enable_salt: true`, the strategy adds a short deterministic
+`[salt: ...]` marker before the system prompt for the trajectory repeat and
+another after the system prompt for the conversation. Each salt is four hex characters.
+This restricts kv-cache reuse to:
+
+1. Fully allowed within a trajectory.
+2. System prompt allowed within same iteration of the dataset.
+3. Disallowed across multiple iterations of dataset.
+
+### Inline Accuracy
+
+When `accuracy_config.eval_method: multi_turn_inline` is set on the performance
+dataset, the benchmark scores the generated `events.jsonl` during finalization
+and writes `scores.json` under `report_dir`. The scorer uses the loaded
+multi-turn dataset as ground truth, matches completed assistant responses back
+to their conversation/turn ids, and compares them with the expected assistant
+turns embedded in the dataset. It does not issue a separate accuracy phase.
+
+### Tail Management
+
+Multi-turn benchmarks can have a long tail because different users receive
+trajectories with very different turn counts, delays, and generated lengths. In
+large runs this tail can last up to an hour after steady-state work has already
+ended, so the benchmark separates the performance window from the remaining
+accuracy/logging drain.
+
+The benchmark stops performance tracking when the first active user finishes its
+final assigned trajectory. It emits `STOP_PERFORMANCE_TRACKING` at that point to
+avoid measuring the tail. Turns issued before this event remain in the
+performance window even if they finish later; turns issued after it are excluded
+from performance metrics.
+
+For final submissions, keep
+`multi_turn.stop_issuing_on_first_user_complete: false` so the client finishes
+already-started trajectories for accuracy. During optimization, set it to `true`
+to stop issuing future turns at the performance boundary and shorten the tail.
 
 ## Run The Client
 

@@ -1,44 +1,44 @@
 name: "kimi-agentic-benchmark"
 version: "1.0"
-type: "online"
+type: "online" # do not change.
 
 model_params:
   name: "/model"
-  temperature: 1.0
-  top_p: 0.95
-  max_new_tokens: 20000 # covers longest observed assistant turn (~18k tokens)
+  temperature: 1.0 # do not change.
+  top_p: 0.95 # do not change.
+  max_new_tokens: 8192 # do not change.
   chat_template_kwargs:
-    thinking: true
-    preserve_thinking: true
+    thinking: true # do not change.
+    preserve_thinking: true # do not change.
 
 datasets:
-  # Select the dataset to run by updating both `name` and `path`.
-  # Use agentic_coding for coding traces or agentic_workflow for workflow traces.
-  - name: agentic_coding
-    type: performance
-    path: /path/to/agentic_dataset.jsonl
+  - name: agentic_combined
+    type: performance # do not change.
+    path: /path/to/agentic_combined.jsonl
+    accuracy_config:
+      eval_method: multi_turn_inline # required benchmark default.
     multi_turn:
-      turn_timeout_s: 600.0
-      enable_salt: true # add salt after system prompt to prevent cache reuse across trajectories
-      inject_tool_delay: true # add delay before user/tool turns
+      enable_salt: true # required benchmark default.
+      inject_tool_delay: true # required benchmark default.
+      num_trajectories_to_issue: 990 # Should be integer multiple of 990.
+      # Required benchmark default; set to true only for faster optimization/debug runs.
+      stop_issuing_on_first_user_complete: false
 
 settings:
   runtime:
     min_duration_ms: 0
 
   load_pattern:
     type: multi_turn
-    target_concurrency: 8
+    target_concurrency: 8 # Submission-specific concurrency.
 
-  # Mandatory: with the default warmup behaviour, every request fails with
-  # ConnectionResetError because uvicorn closes pre-warmed idle sockets after 5s.
   client:
     warmup_connections: 0
     max_idle_time: 0.5
 
 endpoint_config:
   endpoints:
     - "http://localhost:8000"
-  api_type: openai
+  api_type: openai # do not change.
 
 report_dir: logs/kimi_agentic
@@ -121,7 +121,7 @@ def _get_thread_tokenizer(self) -> PreTrainedTokenizerBase:
         """Return the tokenizer for the current thread, loading it if needed."""
         if getattr(self._thread_local, "tokenizer", None) is None:
             self._thread_local.tokenizer = AutoTokenizer.from_pretrained(
-                self._tokenizer_name
+                self._tokenizer_name, trust_remote_code=True
             )
             # Baseline = tokens contributed by a [user, empty-assistant] pair minus
             # the [user] prefix alone. Some templates (Qwen3-Coder, etc.) reject

@@ -228,6 +228,35 @@ def _check_tokenizer_exists(model_name: str) -> bool:
         return False
 
 
+def _resolve_accuracy_components(
+    dataset_name: str, accuracy_config: Any | None
+) -> tuple[type[Scorer], type[Extractor] | None]:
+    """Validate scorer/extractor config and return resolved classes."""
+    if accuracy_config is None or accuracy_config.eval_method is None:
+        raise InputValidationError(
+            f"Dataset '{dataset_name}' requires accuracy_config with eval_method"
+        )
+
+    try:
+        scorer_cls = Scorer.get(accuracy_config.eval_method)
+    except KeyError as exc:
+        raise InputValidationError(str(exc)) from exc
+    extractor_name = accuracy_config.extractor
+    if extractor_name is None:
+        if scorer_cls.REQUIRES_EXTRACTOR:
+            raise InputValidationError(
+                f"Dataset '{dataset_name}' uses scorer "
+                f"'{accuracy_config.eval_method}' which requires an extractor"
+            )
+        extractor_cls: type[Extractor] | None = None
+    else:
+        try:
+            extractor_cls = Extractor.get(extractor_name)
+        except KeyError as exc:
+            raise InputValidationError(str(exc)) from exc
+    return scorer_cls, extractor_cls
+
+
 def _load_datasets(
     config: BenchmarkConfig, report_dir: Path
 ) -> tuple[Dataset, list[Dataset], list[AccuracyConfiguration]]:
@@ -247,25 +276,10 @@ def _load_datasets(
 
     # Pack the evaluation parameters for each accuracy dataset
     for acc_cfg in accuracy_cfgs:
-        if (
-            acc_cfg.accuracy_config is None
-            or acc_cfg.accuracy_config.eval_method is None
-        ):
-            raise InputValidationError(
-                f"Dataset '{acc_cfg.name}' requires accuracy_config with eval_method"
-            )
-
-        scorer_cls = Scorer.get(acc_cfg.accuracy_config.eval_method)
-        extractor_name = acc_cfg.accuracy_config.extractor
-        if extractor_name is None:
-            if scorer_cls.REQUIRES_EXTRACTOR:
-                raise InputValidationError(
-                    f"Dataset '{acc_cfg.name}' uses scorer "
-                    f"'{acc_cfg.accuracy_config.eval_method}' which requires an extractor"
-                )
-            extractor_cls: type[Extractor] | None = None
-        else:
-            extractor_cls = Extractor.get(extractor_name)
+        scorer_cls, extractor_cls = _resolve_accuracy_components(
+            acc_cfg.name, acc_cfg.accuracy_config
+        )
+        assert acc_cfg.accuracy_config is not None
 
         ds = DataLoaderFactory.create_loader(
             acc_cfg, num_repeats=acc_cfg.accuracy_config.num_repeats
@@ -290,12 +304,13 @@ def _load_datasets(
         logger.info(f"Loaded {ds} - {ds.num_samples()} samples")
 
     if not accuracy_cfgs:
-        logger.info("No accuracy datasets provided")
+        logger.info("No separate accuracy datasets provided")
     if len(performance_cfgs) > 1:
         raise InputValidationError("Multiple performance datasets not supported")
 
+    perf_cfg = performance_cfgs[0]
     try:
-        dataloader = DataLoaderFactory.create_loader(performance_cfgs[0])
+        dataloader = DataLoaderFactory.create_loader(perf_cfg)
         dataloader.load(
             api_type=config.endpoint_config.api_type, model_params=config.model_params
         )
@@ -307,6 +322,31 @@ def _load_datasets(
     except Exception as e:
         raise SetupError(f"Failed to load dataset: {e}") from e
 
+    if perf_cfg.accuracy_config is not None:
+        accuracy_config = perf_cfg.accuracy_config
+        if accuracy_config.num_repeats != 1:
+            raise InputValidationError(
+                f"Dataset '{perf_cfg.name}' is a performance dataset; "
+                "accuracy_config.num_repeats must be 1 because scoring runs on "
+                "already-issued performance outputs"
+            )
+        scorer_cls, extractor_cls = _resolve_accuracy_components(
+            perf_cfg.name, accuracy_config
+        )
+
+        eval_configs.append(
+            AccuracyConfiguration(
+                scorer_cls,
+                extractor_cls,
+                "performance",
+                dataloader,
+                report_dir,
+                accuracy_config.ground_truth,
+                accuracy_config.num_repeats,
+                accuracy_config.extras or {},
+            )
+        )
+
     return dataloader, accuracy_datasets, eval_configs
 
 
@@ -433,6 +473,8 @@ def _build_phases(
     # Accuracy phases — use eval_cfg.dataset_name as phase name so it matches
     # what Scorer._load_sample_index_map() looks up in sample_idx_map.json
     for eval_cfg in ctx.eval_configs:
+        if eval_cfg.dataset_name == "performance":
+            continue
         acc_ds = eval_cfg.dataset
         if isinstance(acc_ds, MultiTurnDataset):
             raise InputValidationError(
@@ -859,15 +901,17 @@ def finalize_benchmark(ctx: BenchmarkContext, bench: BenchmarkResult) -> None:
         )
         score, n_repeats = scorer_instance.score()
         assert eval_cfg.dataset.data is not None
+        num_samples = len(eval_cfg.dataset.data)
+        if eval_cfg.dataset_name == "performance":
+            num_samples = sum(phase.issued_count for phase in result.perf_results)
         accuracy_scores[eval_cfg.dataset_name] = {
             "dataset_name": eval_cfg.dataset_name,
-            "num_samples": len(eval_cfg.dataset.data),
+            "num_samples": num_samples,
             "extractor": (
                 eval_cfg.extractor.__name__ if eval_cfg.extractor is not None else None
             ),
             "ground_truth_column": eval_cfg.ground_truth_column,
             "score": score,
-            "n_repeats": n_repeats,
         }
         logger.info(f"Score for {eval_cfg.dataset_name}: {score} ({n_repeats} repeats)")