feat(bfcl): add BFCL v4 edge-agentic accuracy integration#346
feat(bfcl): add BFCL v4 edge-agentic accuracy integration#346Palanivelg wants to merge 19 commits into
Conversation
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
There was a problem hiding this comment.
Code Review
This pull request adds support for Berkeley Function Calling Leaderboard (BFCL) v4 accuracy benchmarking, introducing single-turn and multi-turn evaluation pipelines, datasets, and adapters, along with sequential sample ordering for deterministic evaluation. The code review feedback identifies three key improvements: moving the tool_calls parsing loop inside the try-except block in bfcl_v4_execution.py to prevent crashes on invalid JSON, conditionally including tools and tool_choice in the request payload in bfcl_v4_multi_turn_runner.py to avoid API errors when no tools are present, and guarding the n_repeats calculation in bfcl_v4_scorer.py to ensure it does not evaluate to zero if some samples fail.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
Integrate Berkeley Function Calling Leaderboard (BFCL) v4 evaluation
into the accuracy pipeline. Covers both single-turn function-calling
subsets (non_live, live, hallucination) and the agentic multi-turn
subsets (multi_turn_base + variants), validated against evalscope.
Single-turn (drop-in scorer):
- BFCLv4 predefined dataset (categories=[non_live|live|hallucination],
configurable sample_pct) plus a default preset.
- BFCLv4Scorer wires bfcl-eval's ast_checker into the standard
accuracy phase via the existing scorer registry.
- FunctionCallExtractor: normalizes native tool_calls JSON, JSON
arrays of function-call objects, and text-form function calls
into the canonical BFCL input format.
- openai_msgspec_adapter now:
* passes tool_choice through,
* emits tool_calls verbatim as output_text for scoring,
* coerces whole-number temperatures (vLLM strictness),
* makes max_completion_tokens optional and uses a permissive
ColumnFilter so messages+tools datasets pass through.
Multi-turn (agentic loop, outside the standard scorer):
- BFCLExecutionBridge: parses tool_calls, executes them locally
against bfcl-eval's class instances, and constructs the tool
response messages for the next turn.
- BFCLMultiTurnRunner: drives the per-entry agentic conversation
via httpx (bounded by max_steps_per_turn / timeout_s).
- BFCLv4MultiTurnScorer: invokes bfcl-eval's multi_turn_checker.
- bfcl_v4_multi_turn_cli: standalone CLI for the multi-turn flow.
Supporting plumbing:
- SequentialSampleOrder + `sequential=` flag on create_sample_order,
used by accuracy phases so ordering matches reference runs.
- BenchmarkConfig.Dataset.params: dict for dataset-specific kwargs
(e.g. categories, sample_pct) plumbed through DataLoaderFactory.
- ScorerMethod.BFCL_V4.
- Dataset.load() preserves user-provided ColumnFilters when the
adapter would otherwise inject a conflicting one.
- `--accuracy-only` benchmark flag: skip the performance phase
entirely (forces num_workers=1, max_connections=1 for
deterministic per-sample ordering).
Optional dep: `pip install -e ".[bfcl]"` (`bfcl-eval==2026.3.23`).
The top-level numpy pin must be relaxed to `>=1.26.4` because
bfcl-eval hard-pins `numpy==1.26.4` — shipped as a separate
prerequisite commit on the chore/relax-numpy-pin branch.
Validation (Qwen3.6-27B-Q4_K_M, temperature=0):
- Single-turn live (10%): ~82% accuracy.
- Multi-turn base (full 200): 140/200 = 70.00%, exact parity
(100% match, 0 mismatches) with evalscope on identical inputs.
Add a committed config that reproduces the validated <3h BFCL v4 accuracy
run on an embedded device (Thor).
Dataset (BFCLv4.generate):
- category_sample_pct: per-category sampling rates (e.g. non_live 20 /
live 10 / hallucination 5), resolved per subset via SUBSET_TO_CATEGORY.
- subset_floor: subsets whose TOTAL size is <= floor are taken in full,
preventing tiny subsets (live_parallel=16, live_parallel_multiple=24)
from collapsing to one or two noisy samples. Selection stays
deterministic (head(n)). Plain sample_pct behavior is unchanged.
Multi-turn CLI:
- --sample-pct: deterministic per-subset sub-sampling so a 3% (~24 entry)
multi-turn run is reproducible. Defaults to all entries.
examples/10_BFCLv4_Example/:
- offline_bfcl_v4_single_turn.yaml: single-turn (non_live + live +
hallucination) accuracy config, run with --accuracy-only.
- README.md: documents the two run paths (single-turn YAML vs multi-turn
CLI), the per-category sampling + floor, and the ~2h49m Thor budget.
The example and docstring referenced a specific device; reword to the generic "edge device" since the <3h budget applies to embedded targets broadly.
In --accuracy-only runs there is no performance phase, so ctx.rt_settings is None. _run_benchmark_async read ctx.rt_settings.max_duration_ms unconditionally, raising AttributeError at session setup. The global timeout only applies to the performance phase, so default max_duration_ms to None when rt_settings is absent.
--accuracy-only forces a single connection for deterministic sample ordering, which serializes the offline MAX_THROUGHPUT burst. For large accuracy datasets the sequential processing time exceeds the hardcoded 240 s drain cap, so the phase aborted mid-run dropping in-flight samples. Make drain_timeout a per-phase field defaulting to 240.0 (performance phases unchanged). Accuracy phases pass None to drain unbounded, since every sample must complete; a dropped transport still unblocks the wait via the _receive_responses close path. Re-check inflight after clear() to close a completion/clear race on the unbounded path.
The msgspec adapter serialized tool_calls into `output` AND kept them in the
structured `tool_calls` field. TextModelOutput.__str__ then appended the
tool_calls a second time, producing duplicated, malformed JSON
(`[{...}][{...}]`) that FunctionCallExtractor could not parse. This made every
single-turn function-calling subset score 0% (and gave hallucination subsets a
spurious 100%).
Keep `output` as the textual content only; the structured tool_calls field is
the single source serialized once by __str__. This matches the non-msgspec
OpenAI adapter and the streaming accumulator, which already did this.
Multi-turn is unaffected: it uses a separate httpx runner that reads structured
tool_calls from the raw response and never touches TextModelOutput.
…mbles
When a model emits a prose preamble alongside a tool call (e.g.
"To compute this, I'll call...\n[{...}]"), str(TextModelOutput) prefixes the
tool-call JSON with that text, so the function-call parser's json.loads fast
path fails and the sample scores 0.
Override BFCLv4Scorer.get_outputs to prefer the structured tool_calls field
when present (the function call is the answer; the prose is chatter), falling
back to the full string for plain-text responses such as hallucination
refusals. Verified deterministic across repeated fresh-server single-turn runs.
Some servers (e.g. trtllm-serve on edge devices) stall when tools are present but tool_choice is omitted, relying on a server default. Set tool_choice="auto" explicitly on each single-turn sample and pass it through the function_calling preset's ColumnFilter so it reaches the request payload. Multi-turn already sends tool_choice="auto" via its dedicated runner, so this only affects the single-turn path.
Add ModelParams.seed field and propagate it to the OpenAI wire format so runs can be made reproducible: - config/schema.py: add seed field to ModelParams - openai/types.py: add seed field to ChatCompletionRequest - openai/openai_adapter.py: include seed in metadata dict - openai/openai_msgspec_adapter.py: include seed in metadata dict and ChatCompletionRequest construction - evaluation/bfcl_v4_multi_turn_runner.py: accept seed param; inject payload["seed"] when set - evaluation/bfcl_v4_multi_turn_cli.py: expose --seed CLI arg and pass it to BFCLMultiTurnRunner - commands/benchmark/cli.py: expose --seed and --report-dir overrides on the from-config subcommand - tests: unit coverage for seed propagation in msgspec adapter, multi-turn runner, and from-config CLI
Expand examples/10_BFCLv4_Example/README.md: - Add a "Reproducing from the PRs" section explaining that PR #1 (numpy pin) is a prerequisite for PR #2 to install [bfcl] - Show how to check out and install from the branches - Document --seed flag for both single-turn (from-config) and multi-turn CLI paths - Replace placeholder accuracy numbers with confirmed Thor validation results (Qwen3.6-27B-Q4_K_M, temperature=0, 456 ST samples): non_live 86.98%, live 84.12%, hallucination 94.32%, overall 87.50% (both seed runs identical); MT base 70.00% (exact evalscope parity) - Add output file paths and a quick sanity-check script
Replace the terse reference doc with a numbered walkthrough that works for someone unfamiliar with the endpoints repo: - What is this / What is the endpoints repo - Step 0: prerequisites including a llama.cpp Docker quick-start - Step 1: install from the two PRs with conflict explanation - Step 2: run single-turn (with YAML config notes) - Step 3: run multi-turn - Step 4: verify results with one-liners - Seed reproducibility section - Reference results table (Thor, two seed runs, evalscope parity)
- Fix memory requirement: ~24 GB (not 16 GB) for the Q4 GGUF + KV cache
- Replace generic Docker quick-start with Thor-specific llama.cpp native
build instructions (Docker CUDA images don't target sm_110/aarch64-SBSA)
- Add x86_64 Docker quick-start in a collapsible details block
- Fix Step 4 result path: results.json under accuracy_scores key, not
a separate accuracy_scores.json; add report.txt note
- Add server-side determinism note (--seed 42, -np 1 on llama-server)
- Replace placeholder MT numbers with actual sampled-run Thor results:
multi_turn_base 66.67% (4/6), miss_func 33.33% (2/6),
miss_param 16.67% (1/6), long_context 66.67% (4/6),
overall 45.84% (24 entries) — identical across both seed runs
- Separate full multi_turn_base parity result (140/200 = 70.00%) into
its own subsection to avoid conflating sampled and full-set numbers
- Update wall-clock: ~82 min ST + ~64 min MT ≈ 2.4–2.5 h total
bfcl-eval's Qwen model handler transitively imports qwen_agent which requires soundfile; without it the import fails on Thor and any machine where soundfile is not already installed.
… run_accuracy.sh Renames examples/10_BFCLv4_Example to examples/10_Edge_Agentic_Example to align with the MLPerf edge-agentic submission category name. Adds run_accuracy.sh — a single script that reproduces both single-turn and multi-turn reference accuracy numbers end-to-end with the exact validated parameters (sampling rates, temperature=0, seed=42, max-steps-per-turn=25). Updates README to lead with the one-liner quick-start referencing the script, fixes the install instructions to point to mlcommons/endpoints (not the fork), adds --seed and --max-steps-per-turn to the Step 3 MT snippet, and corrects the internal path reference in online_agentic_coding_perf.yaml.
9f0186a to
37c74d2
Compare
- bfcl_v4_execution: move tool-call argument JSON parsing inside the try-except block so json.JSONDecodeError from malformed model output (common on small/quantized models) is caught and handled gracefully rather than crashing the evaluation run. - bfcl_v4_multi_turn_runner: only include tools/tool_choice in the request payload when the tools list is non-empty, avoiding 400 errors on endpoints that reject tool_choice without accompanying tools. - bfcl_v4_scorer: guard n_repeats with max(1, ...) so a partial run where fewer samples completed than num_samples() does not produce a zero divisor and incorrect reporting.
- Relax base numpy to >=1.26.4 so bfcl-eval's numpy==1.26.4 pin resolves; regenerate uv.lock. - Regenerate stale *_template_full.yaml config templates after schema change. - Fix mypy: annotate tool_calls/tool_call_ids and narrow Optional messages/tools in the multi-turn runner; mark BFCLv4Scorer.score override. - Isolate the bfcl extra via [tool.uv].conflicts and add patched filelock/virtualenv floors to the dev extra so bfcl-eval's filelock==3.20.0 pin no longer drags shared tooling deps into CVE versions (CVE-2025-68146, CVE-2026-22701, CVE-2026-22702).
The accuracy phase hardcoded drain_timeout=None, which ignored a user-configured DrainConfig.accuracy_timeout_s and failed test_configured_drain_timeouts_propagate_to_phases. accuracy_timeout_s already defaults to None (unbounded), so reading it preserves the unbounded default while honoring an explicit timeout.
from-config has no --model-params.name / --endpoint-config.endpoints overrides, so the script errored immediately. Render a temp YAML with MODEL/ENDPOINT substituted into the committed config (anchored on the "# set to your ..." comments) so the one-liner still works without editing the tracked file.
What does this PR do?
Integrates the edge-agentic example (BFCL-v4 as the accuracy set): single-turn + multi-turn pipelines, datasets, adapters, and a reproducible run script. See
examples/10_Edge_Agentic_Example/README.md.Type of change
Related issues
Self-contained: this PR includes the numpy relaxation (
numpy>=1.26.4) required bybfcl-eval(which hard-pinsnumpy==1.26.4), so it no longer depends on a separate prerequisite PR. The[bfcl]extra is isolated via[tool.uv].conflictsso its old pins don't constrain the shared tooling deps. This supersedes #345 (closed as redundant).Testing
Checklist