Skip to content
149 changes: 119 additions & 30 deletions examples/09_MultiTurn/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,36 +51,125 @@ clients; set `model_params.name` in the YAML to the same value.
The runnable config is
`examples/09_MultiTurn/kimi_agentic_benchmark.yaml`.

Key fields:

- `type: online`: runs the benchmark through the online scheduler.
- `model_params.name`: model name sent in each OpenAI request. Keep it aligned
with the served model name.
- `model_params.temperature`, `top_p`, `max_new_tokens`: sampling settings sent
to the server. `max_new_tokens` is large because agent turns can be long.
- `model_params.chat_template_kwargs`: Kimi-specific template options for
reasoning preservation.
- First `datasets` entry `name`: label used in benchmark outputs.
- First `datasets` entry `type: performance`: multi-turn datasets are replayed as
performance datasets.
- First `datasets` entry `path`: JSONL dataset path to run.
- First `datasets` entry `multi_turn.turn_timeout_s`: per-turn deadline. A
timeout aborts the remaining turns in that conversation.
- First `datasets` entry `multi_turn.enable_salt`: appends a deterministic cache
salt to each conversation system prompt.
- First `datasets` entry `multi_turn.inject_tool_delay`: honors positive
`delay_seconds` values from client turns before issuing those turns.
- `settings.runtime.min_duration_ms`: minimum run duration. With no max duration
override, the run finishes when the dataset is exhausted.
- `settings.load_pattern.type: multi_turn`: enables conversation-aware issuing.
- `settings.load_pattern.target_concurrency`: maximum active conversations.
Each active conversation has at most one in-flight request.
- `settings.client.warmup_connections: 0`: avoids stale pre-warmed sockets with
servers that close idle connections quickly.
- `settings.client.max_idle_time`: connection idle lifetime.
- `endpoint_config.endpoints`: server URL list.
- `endpoint_config.api_type: openai`: use `/v1/chat/completions`.
- `report_dir`: output directory for events, snapshots, and reports.
### Fields

- `name`: human-readable run name written to reports and logs. Change this when
creating a distinct benchmark config.
- `version`: config version label for this example.
- `type`: scheduler mode for the run.
- `model_params.name`: model name sent in each OpenAI request. Set this to the
model name served by the endpoint.
- `model_params.temperature`: sampling temperature sent to the server.
- `model_params.top_p`: nucleus sampling value sent to the server.
- `model_params.max_new_tokens`: per-turn generation cap.
- `model_params.chat_template_kwargs.thinking`: Kimi chat-template option.
- `model_params.chat_template_kwargs.preserve_thinking`: preserves
reasoning content in the rendered prompt.
- First dataset `name`: label used in benchmark outputs. Change this to match
the dataset variant being run.
- First dataset `type`: dataset role for this entry.
- First dataset `path`: JSONL dataset path to run. Set this to a real local or
mounted dataset path, for example `/path/to/agentic_combined.jsonl`.
- First dataset `accuracy_config.eval_method`: scorer used during finalization.
`multi_turn_inline` scores the performance replay outputs without issuing a
separate accuracy phase.
- First dataset `multi_turn.enable_salt`: applies deterministic salt
markers when issuing conversation instances so repeats do not reuse KV cache
by accident.
- First dataset `multi_turn.inject_tool_delay`: honors positive
`delay_seconds` values from the dataset before issuing user/tool turns.
- First dataset `multi_turn.num_trajectories_to_issue`: total number of
trajectories to start. Change this to scale runtime.
- First dataset `multi_turn.stop_issuing_on_first_user_complete`: controls only
whether the client keeps issuing after the measurement window ends. Performance
tracking always stops when the first concurrency slot finishes a trajectory and
there is no next trajectory left to assign. If this field is `true`, the client
stops issuing future turns at that point and drains already in-flight turns. If
this field is `false`, the client keeps replaying already-started active
trajectories to completion for accuracy/log coverage, but those later-issued
turns are outside the performance measurement window.
- `settings.runtime.min_duration_ms`: minimum run duration. Multi-turn replay
completion is controlled by trajectory budget and active conversation drain.
- `settings.load_pattern.type`: enables conversation-aware issuing.
- `settings.load_pattern.target_concurrency`: maximum active conversations. Each
active conversation has at most one in-flight request. Change this for the
target concurrency of the run.
- `settings.client.warmup_connections`: disables pre-warmed HTTP sockets.
- `settings.client.max_idle_time`: connection idle lifetime in seconds.
- `endpoint_config.endpoints`: server URL list. Replace with the endpoint URLs
for the run.
- `endpoint_config.api_type`: selects the endpoint protocol and route.
- `report_dir`: output directory for events, snapshots, scores, and reports.
Change this per run so outputs are not overwritten.

### Benchmark Invariants

For official Kimi agentic benchmark runs, keep these values fixed:

- `version: "1.0"`
- `type: "online"`
- `model_params.temperature: 1.0`
- `model_params.top_p: 0.95`
- `model_params.max_new_tokens: 8192`
- `model_params.chat_template_kwargs.thinking: true`
- `model_params.chat_template_kwargs.preserve_thinking: true`
- First dataset `type: performance`
- First dataset `accuracy_config.eval_method: multi_turn_inline`
- `settings.runtime.min_duration_ms: 0`
- `settings.load_pattern.type: multi_turn`
- `settings.client.warmup_connections: 0`
- `settings.client.max_idle_time: 0.5`
- `endpoint_config.api_type: openai`

The multi-turn dataset required defaults are:

- First dataset `multi_turn.enable_salt: true`
- First dataset `multi_turn.inject_tool_delay: true`
- First dataset `multi_turn.stop_issuing_on_first_user_complete: false`

Set `multi_turn.num_trajectories_to_issue` to an integer multiple of the
dataset trajectory count so each repeat has the same representation. Use
`multi_turn.stop_issuing_on_first_user_complete: true` only for faster
optimization/debug runs, not official benchmark runs.

### Salting Mechanism

When `multi_turn.enable_salt: true`, the strategy adds a short deterministic
`[salt: ...]` marker before the system prompt for the trajectory repeat and
another after the system prompt for the conversation. Each salt is four hex characters.
This restricts kv-cache reuse to:

1. Fully allowed within a trajectory.
2. System prompt allowed within same iteration of the dataset.
3. Disallowed across multiple iterations of dataset.

### Inline Accuracy

When `accuracy_config.eval_method: multi_turn_inline` is set on the performance
dataset, the benchmark scores the generated `events.jsonl` during finalization
and writes `scores.json` under `report_dir`. The scorer uses the loaded
multi-turn dataset as ground truth, matches completed assistant responses back
to their conversation/turn ids, and compares them with the expected assistant
turns embedded in the dataset. It does not issue a separate accuracy phase.

### Tail Management

Multi-turn benchmarks can have a long tail because different users receive
trajectories with very different turn counts, delays, and generated lengths. In
large runs this tail can last up to an hour after steady-state work has already
ended, so the benchmark separates the performance window from the remaining
accuracy/logging drain.

The benchmark stops performance tracking when the first active user finishes its
final assigned trajectory. It emits `STOP_PERFORMANCE_TRACKING` at that point to
avoid measuring the tail. Turns issued before this event remain in the
performance window even if they finish later; turns issued after it are excluded
from performance metrics.

For final submissions, keep
`multi_turn.stop_issuing_on_first_user_complete: false` so the client finishes
already-started trajectories for accuracy. During optimization, set it to `true`
to stop issuing future turns at the performance boundary and shorten the tail.

## Run The Client

Expand Down
36 changes: 18 additions & 18 deletions examples/09_MultiTurn/kimi_agentic_benchmark.yaml
Original file line number Diff line number Diff line change
@@ -1,44 +1,44 @@
name: "kimi-agentic-benchmark"
version: "1.0"
type: "online"
type: "online" # do not change.

model_params:
name: "/model"
temperature: 1.0
top_p: 0.95
max_new_tokens: 20000 # covers longest observed assistant turn (~18k tokens)
temperature: 1.0 # do not change.
top_p: 0.95 # do not change.
max_new_tokens: 8192 # do not change.
chat_template_kwargs:
thinking: true
preserve_thinking: true
thinking: true # do not change.
preserve_thinking: true # do not change.

datasets:
# Select the dataset to run by updating both `name` and `path`.
# Use agentic_coding for coding traces or agentic_workflow for workflow traces.
- name: agentic_coding
type: performance
path: /path/to/agentic_dataset.jsonl
- name: agentic_combined
type: performance # do not change.
path: /path/to/agentic_combined.jsonl
accuracy_config:
eval_method: multi_turn_inline # required benchmark default.
multi_turn:
turn_timeout_s: 600.0
enable_salt: true # add salt after system prompt to prevent cache reuse across trajectories
inject_tool_delay: true # add delay before user/tool turns
enable_salt: true # required benchmark default.
inject_tool_delay: true # required benchmark default.
num_trajectories_to_issue: 990 # Should be integer multiple of 990.
# Required benchmark default; set to true only for faster optimization/debug runs.
stop_issuing_on_first_user_complete: false

settings:
runtime:
min_duration_ms: 0

load_pattern:
type: multi_turn
target_concurrency: 8
target_concurrency: 8 # Submission-specific concurrency.

# Mandatory: with the default warmup behaviour, every request fails with
# ConnectionResetError because uvicorn closes pre-warmed idle sockets after 5s.
client:
warmup_connections: 0
max_idle_time: 0.5

endpoint_config:
endpoints:
- "http://localhost:8000"
api_type: openai
api_type: openai # do not change.

report_dir: logs/kimi_agentic
Original file line number Diff line number Diff line change
Expand Up @@ -121,7 +121,7 @@ def _get_thread_tokenizer(self) -> PreTrainedTokenizerBase:
"""Return the tokenizer for the current thread, loading it if needed."""
if getattr(self._thread_local, "tokenizer", None) is None:
self._thread_local.tokenizer = AutoTokenizer.from_pretrained(
self._tokenizer_name
self._tokenizer_name, trust_remote_code=True
)
# Baseline = tokens contributed by a [user, empty-assistant] pair minus
# the [user] prefix alone. Some templates (Qwen3-Coder, etc.) reject
Expand Down
90 changes: 67 additions & 23 deletions src/inference_endpoint/commands/benchmark/execute.py
Original file line number Diff line number Diff line change
Expand Up @@ -228,6 +228,35 @@ def _check_tokenizer_exists(model_name: str) -> bool:
return False


def _resolve_accuracy_components(
dataset_name: str, accuracy_config: Any | None
) -> tuple[type[Scorer], type[Extractor] | None]:
"""Validate scorer/extractor config and return resolved classes."""
if accuracy_config is None or accuracy_config.eval_method is None:
raise InputValidationError(
f"Dataset '{dataset_name}' requires accuracy_config with eval_method"
)

try:
scorer_cls = Scorer.get(accuracy_config.eval_method)
except KeyError as exc:
raise InputValidationError(str(exc)) from exc
extractor_name = accuracy_config.extractor
if extractor_name is None:
if scorer_cls.REQUIRES_EXTRACTOR:
raise InputValidationError(
f"Dataset '{dataset_name}' uses scorer "
f"'{accuracy_config.eval_method}' which requires an extractor"
)
extractor_cls: type[Extractor] | None = None
else:
try:
extractor_cls = Extractor.get(extractor_name)
except KeyError as exc:
raise InputValidationError(str(exc)) from exc
return scorer_cls, extractor_cls


def _load_datasets(
config: BenchmarkConfig, report_dir: Path
) -> tuple[Dataset, list[Dataset], list[AccuracyConfiguration]]:
Expand All @@ -247,25 +276,10 @@ def _load_datasets(

# Pack the evaluation parameters for each accuracy dataset
for acc_cfg in accuracy_cfgs:
if (
acc_cfg.accuracy_config is None
or acc_cfg.accuracy_config.eval_method is None
):
raise InputValidationError(
f"Dataset '{acc_cfg.name}' requires accuracy_config with eval_method"
)

scorer_cls = Scorer.get(acc_cfg.accuracy_config.eval_method)
extractor_name = acc_cfg.accuracy_config.extractor
if extractor_name is None:
if scorer_cls.REQUIRES_EXTRACTOR:
raise InputValidationError(
f"Dataset '{acc_cfg.name}' uses scorer "
f"'{acc_cfg.accuracy_config.eval_method}' which requires an extractor"
)
extractor_cls: type[Extractor] | None = None
else:
extractor_cls = Extractor.get(extractor_name)
scorer_cls, extractor_cls = _resolve_accuracy_components(
acc_cfg.name, acc_cfg.accuracy_config
)
assert acc_cfg.accuracy_config is not None

ds = DataLoaderFactory.create_loader(
acc_cfg, num_repeats=acc_cfg.accuracy_config.num_repeats
Expand All @@ -290,12 +304,13 @@ def _load_datasets(
logger.info(f"Loaded {ds} - {ds.num_samples()} samples")

if not accuracy_cfgs:
logger.info("No accuracy datasets provided")
logger.info("No separate accuracy datasets provided")
if len(performance_cfgs) > 1:
raise InputValidationError("Multiple performance datasets not supported")

perf_cfg = performance_cfgs[0]
try:
dataloader = DataLoaderFactory.create_loader(performance_cfgs[0])
dataloader = DataLoaderFactory.create_loader(perf_cfg)
dataloader.load(
api_type=config.endpoint_config.api_type, model_params=config.model_params
)
Expand All @@ -307,6 +322,31 @@ def _load_datasets(
except Exception as e:
raise SetupError(f"Failed to load dataset: {e}") from e

if perf_cfg.accuracy_config is not None:
accuracy_config = perf_cfg.accuracy_config
if accuracy_config.num_repeats != 1:
raise InputValidationError(
f"Dataset '{perf_cfg.name}' is a performance dataset; "
"accuracy_config.num_repeats must be 1 because scoring runs on "
"already-issued performance outputs"
)
scorer_cls, extractor_cls = _resolve_accuracy_components(
perf_cfg.name, accuracy_config
)

eval_configs.append(
AccuracyConfiguration(
scorer_cls,
extractor_cls,
"performance",
dataloader,
report_dir,
accuracy_config.ground_truth,
accuracy_config.num_repeats,
accuracy_config.extras or {},
)
)

return dataloader, accuracy_datasets, eval_configs


Expand Down Expand Up @@ -433,6 +473,8 @@ def _build_phases(
# Accuracy phases — use eval_cfg.dataset_name as phase name so it matches
# what Scorer._load_sample_index_map() looks up in sample_idx_map.json
for eval_cfg in ctx.eval_configs:
if eval_cfg.dataset_name == "performance":
continue
acc_ds = eval_cfg.dataset
if isinstance(acc_ds, MultiTurnDataset):
raise InputValidationError(
Expand Down Expand Up @@ -859,15 +901,17 @@ def finalize_benchmark(ctx: BenchmarkContext, bench: BenchmarkResult) -> None:
)
score, n_repeats = scorer_instance.score()
assert eval_cfg.dataset.data is not None
num_samples = len(eval_cfg.dataset.data)
if eval_cfg.dataset_name == "performance":
num_samples = sum(phase.issued_count for phase in result.perf_results)
accuracy_scores[eval_cfg.dataset_name] = {
"dataset_name": eval_cfg.dataset_name,
"num_samples": len(eval_cfg.dataset.data),
"num_samples": num_samples,
"extractor": (
eval_cfg.extractor.__name__ if eval_cfg.extractor is not None else None
),
"ground_truth_column": eval_cfg.ground_truth_column,
"score": score,
"n_repeats": n_repeats,
}
logger.info(f"Score for {eval_cfg.dataset_name}: {score} ({n_repeats} repeats)")

Expand Down
Loading
Loading