feat(bfcl): add BFCL v4 edge-agentic accuracy integration by Palanivelg · Pull Request #346 · mlcommons/endpoints

Palanivelg · 2026-06-08T17:06:02Z

What does this PR do?

Integrates the edge-agentic example (BFCL-v4 as the accuracy set): single-turn + multi-turn pipelines, datasets, adapters, and a reproducible run script. See examples/10_Edge_Agentic_Example/README.md.

Type of change

Bug fix
[x ] New feature
Documentation update
Refactor/cleanup

Related issues

Self-contained: this PR includes the numpy relaxation (numpy>=1.26.4) required by bfcl-eval (which hard-pins numpy==1.26.4), so it no longer depends on a separate prerequisite PR. The [bfcl] extra is isolated via [tool.uv].conflicts so its old pins don't constrain the shared tooling deps. This supersedes #345 (closed as redundant).

Testing

Tests added/updated
All tests pass locally
Manual testing completed

Checklist

Code follows project style
Pre-commit hooks pass
Documentation updated (if needed)

github-actions · 2026-06-08T17:06:16Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

gemini-code-assist

Code Review

This pull request adds support for Berkeley Function Calling Leaderboard (BFCL) v4 accuracy benchmarking, introducing single-turn and multi-turn evaluation pipelines, datasets, and adapters, along with sequential sample ordering for deterministic evaluation. The code review feedback identifies three key improvements: moving the tool_calls parsing loop inside the try-except block in bfcl_v4_execution.py to prevent crashes on invalid JSON, conditionally including tools and tool_choice in the request payload in bfcl_v4_multi_turn_runner.py to avoid API errors when no tools are present, and guarding the n_repeats calculation in bfcl_v4_scorer.py to ensure it does not evaluate to zero if some samples fail.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Integrate Berkeley Function Calling Leaderboard (BFCL) v4 evaluation into the accuracy pipeline. Covers both single-turn function-calling subsets (non_live, live, hallucination) and the agentic multi-turn subsets (multi_turn_base + variants), validated against evalscope. Single-turn (drop-in scorer): - BFCLv4 predefined dataset (categories=[non_live|live|hallucination], configurable sample_pct) plus a default preset. - BFCLv4Scorer wires bfcl-eval's ast_checker into the standard accuracy phase via the existing scorer registry. - FunctionCallExtractor: normalizes native tool_calls JSON, JSON arrays of function-call objects, and text-form function calls into the canonical BFCL input format. - openai_msgspec_adapter now: * passes tool_choice through, * emits tool_calls verbatim as output_text for scoring, * coerces whole-number temperatures (vLLM strictness), * makes max_completion_tokens optional and uses a permissive ColumnFilter so messages+tools datasets pass through. Multi-turn (agentic loop, outside the standard scorer): - BFCLExecutionBridge: parses tool_calls, executes them locally against bfcl-eval's class instances, and constructs the tool response messages for the next turn. - BFCLMultiTurnRunner: drives the per-entry agentic conversation via httpx (bounded by max_steps_per_turn / timeout_s). - BFCLv4MultiTurnScorer: invokes bfcl-eval's multi_turn_checker. - bfcl_v4_multi_turn_cli: standalone CLI for the multi-turn flow. Supporting plumbing: - SequentialSampleOrder + `sequential=` flag on create_sample_order, used by accuracy phases so ordering matches reference runs. - BenchmarkConfig.Dataset.params: dict for dataset-specific kwargs (e.g. categories, sample_pct) plumbed through DataLoaderFactory. - ScorerMethod.BFCL_V4. - Dataset.load() preserves user-provided ColumnFilters when the adapter would otherwise inject a conflicting one. - `--accuracy-only` benchmark flag: skip the performance phase entirely (forces num_workers=1, max_connections=1 for deterministic per-sample ordering). Optional dep: `pip install -e ".[bfcl]"` (`bfcl-eval==2026.3.23`). The top-level numpy pin must be relaxed to `>=1.26.4` because bfcl-eval hard-pins `numpy==1.26.4` — shipped as a separate prerequisite commit on the chore/relax-numpy-pin branch. Validation (Qwen3.6-27B-Q4_K_M, temperature=0): - Single-turn live (10%): ~82% accuracy. - Multi-turn base (full 200): 140/200 = 70.00%, exact parity (100% match, 0 mismatches) with evalscope on identical inputs.

Add a committed config that reproduces the validated <3h BFCL v4 accuracy run on an embedded device (Thor). Dataset (BFCLv4.generate): - category_sample_pct: per-category sampling rates (e.g. non_live 20 / live 10 / hallucination 5), resolved per subset via SUBSET_TO_CATEGORY. - subset_floor: subsets whose TOTAL size is <= floor are taken in full, preventing tiny subsets (live_parallel=16, live_parallel_multiple=24) from collapsing to one or two noisy samples. Selection stays deterministic (head(n)). Plain sample_pct behavior is unchanged. Multi-turn CLI: - --sample-pct: deterministic per-subset sub-sampling so a 3% (~24 entry) multi-turn run is reproducible. Defaults to all entries. examples/10_BFCLv4_Example/: - offline_bfcl_v4_single_turn.yaml: single-turn (non_live + live + hallucination) accuracy config, run with --accuracy-only. - README.md: documents the two run paths (single-turn YAML vs multi-turn CLI), the per-category sampling + floor, and the ~2h49m Thor budget.

The example and docstring referenced a specific device; reword to the generic "edge device" since the <3h budget applies to embedded targets broadly.

In --accuracy-only runs there is no performance phase, so ctx.rt_settings is None. _run_benchmark_async read ctx.rt_settings.max_duration_ms unconditionally, raising AttributeError at session setup. The global timeout only applies to the performance phase, so default max_duration_ms to None when rt_settings is absent.

--accuracy-only forces a single connection for deterministic sample ordering, which serializes the offline MAX_THROUGHPUT burst. For large accuracy datasets the sequential processing time exceeds the hardcoded 240 s drain cap, so the phase aborted mid-run dropping in-flight samples. Make drain_timeout a per-phase field defaulting to 240.0 (performance phases unchanged). Accuracy phases pass None to drain unbounded, since every sample must complete; a dropped transport still unblocks the wait via the _receive_responses close path. Re-check inflight after clear() to close a completion/clear race on the unbounded path.

The msgspec adapter serialized tool_calls into `output` AND kept them in the structured `tool_calls` field. TextModelOutput.__str__ then appended the tool_calls a second time, producing duplicated, malformed JSON (`[{...}][{...}]`) that FunctionCallExtractor could not parse. This made every single-turn function-calling subset score 0% (and gave hallucination subsets a spurious 100%). Keep `output` as the textual content only; the structured tool_calls field is the single source serialized once by __str__. This matches the non-msgspec OpenAI adapter and the streaming accumulator, which already did this. Multi-turn is unaffected: it uses a separate httpx runner that reads structured tool_calls from the raw response and never touches TextModelOutput.

…mbles When a model emits a prose preamble alongside a tool call (e.g. "To compute this, I'll call...\n[{...}]"), str(TextModelOutput) prefixes the tool-call JSON with that text, so the function-call parser's json.loads fast path fails and the sample scores 0. Override BFCLv4Scorer.get_outputs to prefer the structured tool_calls field when present (the function call is the answer; the prose is chatter), falling back to the full string for plain-text responses such as hallucination refusals. Verified deterministic across repeated fresh-server single-turn runs.

Some servers (e.g. trtllm-serve on edge devices) stall when tools are present but tool_choice is omitted, relying on a server default. Set tool_choice="auto" explicitly on each single-turn sample and pass it through the function_calling preset's ColumnFilter so it reaches the request payload. Multi-turn already sends tool_choice="auto" via its dedicated runner, so this only affects the single-turn path.

Add ModelParams.seed field and propagate it to the OpenAI wire format so runs can be made reproducible: - config/schema.py: add seed field to ModelParams - openai/types.py: add seed field to ChatCompletionRequest - openai/openai_adapter.py: include seed in metadata dict - openai/openai_msgspec_adapter.py: include seed in metadata dict and ChatCompletionRequest construction - evaluation/bfcl_v4_multi_turn_runner.py: accept seed param; inject payload["seed"] when set - evaluation/bfcl_v4_multi_turn_cli.py: expose --seed CLI arg and pass it to BFCLMultiTurnRunner - commands/benchmark/cli.py: expose --seed and --report-dir overrides on the from-config subcommand - tests: unit coverage for seed propagation in msgspec adapter, multi-turn runner, and from-config CLI

Expand examples/10_BFCLv4_Example/README.md: - Add a "Reproducing from the PRs" section explaining that PR #1 (numpy pin) is a prerequisite for PR #2 to install [bfcl] - Show how to check out and install from the branches - Document --seed flag for both single-turn (from-config) and multi-turn CLI paths - Replace placeholder accuracy numbers with confirmed Thor validation results (Qwen3.6-27B-Q4_K_M, temperature=0, 456 ST samples): non_live 86.98%, live 84.12%, hallucination 94.32%, overall 87.50% (both seed runs identical); MT base 70.00% (exact evalscope parity) - Add output file paths and a quick sanity-check script

Replace the terse reference doc with a numbered walkthrough that works for someone unfamiliar with the endpoints repo: - What is this / What is the endpoints repo - Step 0: prerequisites including a llama.cpp Docker quick-start - Step 1: install from the two PRs with conflict explanation - Step 2: run single-turn (with YAML config notes) - Step 3: run multi-turn - Step 4: verify results with one-liners - Seed reproducibility section - Reference results table (Thor, two seed runs, evalscope parity)

- Fix memory requirement: ~24 GB (not 16 GB) for the Q4 GGUF + KV cache - Replace generic Docker quick-start with Thor-specific llama.cpp native build instructions (Docker CUDA images don't target sm_110/aarch64-SBSA) - Add x86_64 Docker quick-start in a collapsible details block - Fix Step 4 result path: results.json under accuracy_scores key, not a separate accuracy_scores.json; add report.txt note - Add server-side determinism note (--seed 42, -np 1 on llama-server) - Replace placeholder MT numbers with actual sampled-run Thor results: multi_turn_base 66.67% (4/6), miss_func 33.33% (2/6), miss_param 16.67% (1/6), long_context 66.67% (4/6), overall 45.84% (24 entries) — identical across both seed runs - Separate full multi_turn_base parity result (140/200 = 70.00%) into its own subsection to avoid conflating sampled and full-set numbers - Update wall-clock: ~82 min ST + ~64 min MT ≈ 2.4–2.5 h total

bfcl-eval's Qwen model handler transitively imports qwen_agent which requires soundfile; without it the import fails on Thor and any machine where soundfile is not already installed.

… run_accuracy.sh Renames examples/10_BFCLv4_Example to examples/10_Edge_Agentic_Example to align with the MLPerf edge-agentic submission category name. Adds run_accuracy.sh — a single script that reproduces both single-turn and multi-turn reference accuracy numbers end-to-end with the exact validated parameters (sampling rates, temperature=0, seed=42, max-steps-per-turn=25). Updates README to lead with the one-liner quick-start referencing the script, fixes the install instructions to point to mlcommons/endpoints (not the fork), adds --seed and --max-steps-per-turn to the Step 3 MT snippet, and corrects the internal path reference in online_agentic_coding_perf.yaml.

- bfcl_v4_execution: move tool-call argument JSON parsing inside the try-except block so json.JSONDecodeError from malformed model output (common on small/quantized models) is caught and handled gracefully rather than crashing the evaluation run. - bfcl_v4_multi_turn_runner: only include tools/tool_choice in the request payload when the tools list is non-empty, avoiding 400 errors on endpoints that reject tool_choice without accompanying tools. - bfcl_v4_scorer: guard n_repeats with max(1, ...) so a partial run where fewer samples completed than num_samples() does not produce a zero divisor and incorrect reporting.

- Relax base numpy to >=1.26.4 so bfcl-eval's numpy==1.26.4 pin resolves; regenerate uv.lock. - Regenerate stale *_template_full.yaml config templates after schema change. - Fix mypy: annotate tool_calls/tool_call_ids and narrow Optional messages/tools in the multi-turn runner; mark BFCLv4Scorer.score override. - Isolate the bfcl extra via [tool.uv].conflicts and add patched filelock/virtualenv floors to the dev extra so bfcl-eval's filelock==3.20.0 pin no longer drags shared tooling deps into CVE versions (CVE-2025-68146, CVE-2026-22701, CVE-2026-22702).

The accuracy phase hardcoded drain_timeout=None, which ignored a user-configured DrainConfig.accuracy_timeout_s and failed test_configured_drain_timeouts_propagate_to_phases. accuracy_timeout_s already defaults to None (unbounded), so reading it preserves the unbounded default while honoring an explicit timeout.

from-config has no --model-params.name / --endpoint-config.endpoints overrides, so the script errored immediately. Render a temp YAML with MODEL/ENDPOINT substituted into the committed config (anchored on the "# set to your ..." comments) so the one-liner still works without editing the tracked file.

Palanivelg requested a review from a team June 8, 2026 17:06

gemini-code-assist Bot reviewed Jun 8, 2026

View reviewed changes

Comment thread src/inference_endpoint/evaluation/bfcl_v4_execution.py Outdated

Comment thread src/inference_endpoint/evaluation/bfcl_v4_multi_turn_runner.py

Comment thread src/inference_endpoint/evaluation/bfcl_v4_scorer.py Outdated

Palanivelg changed the title ~~Feat/bfcl v4 combined~~ feat(bfcl): add BFCL v4 edge-agentic accuracy integration Jun 8, 2026

Palanivelg added 14 commits June 8, 2026 17:16

docs(bfcl): generalize "Thor" to "edge device" in BFCL v4 example

5bef6d7

The example and docstring referenced a specific device; reword to the generic "edge device" since the <3h budget applies to embedded targets broadly.

fix(bfcl): add soundfile dependency for qwen_agent transitive import

ab8b591

bfcl-eval's Qwen model handler transitively imports qwen_agent which requires soundfile; without it the import fails on Thor and any machine where soundfile is not already installed.

Palanivelg force-pushed the feat/bfcl-v4-combined branch from 9f0186a to 37c74d2 Compare June 8, 2026 17:21

Palanivelg mentioned this pull request Jun 8, 2026

feat(edge-agentic): add BFCL v4 edge-agentic reference implementation… mlcommons/inference#2598

Open

docs(bfcl): annotate YAML with do-not-change submission rules

d628805

Palanivelg mentioned this pull request Jun 8, 2026

feat(edge-agentic): add BFCL v4 edge-agentic workload to edge benchma… mlcommons/inference_policies#331

Draft

3 tasks

Palanivelg added 2 commits June 8, 2026 21:14

Palanivelg mentioned this pull request Jun 8, 2026

chore(deps): relax numpy pin to >=1.26.4 for bfcl-eval compatibility #345

Closed

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bfcl): add BFCL v4 edge-agentic accuracy integration#346

feat(bfcl): add BFCL v4 edge-agentic accuracy integration#346
Palanivelg wants to merge 19 commits into
mlcommons:mainfrom
Palanivelg:feat/bfcl-v4-combined

Palanivelg commented Jun 8, 2026 •

edited by zihaok

Loading

Uh oh!

github-actions Bot commented Jun 8, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Palanivelg commented Jun 8, 2026 • edited by zihaok Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Type of change

Related issues

Testing

Checklist

Uh oh!

github-actions Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Palanivelg commented Jun 8, 2026 •

edited by zihaok

Loading

github-actions Bot commented Jun 8, 2026 •

edited

Loading