Skip to content

feat(bfcl): add BFCL v4 edge-agentic accuracy integration#346

Open
Palanivelg wants to merge 19 commits into
mlcommons:mainfrom
Palanivelg:feat/bfcl-v4-combined
Open

feat(bfcl): add BFCL v4 edge-agentic accuracy integration#346
Palanivelg wants to merge 19 commits into
mlcommons:mainfrom
Palanivelg:feat/bfcl-v4-combined

Conversation

@Palanivelg

@Palanivelg Palanivelg commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Integrates the edge-agentic example (BFCL-v4 as the accuracy set): single-turn + multi-turn pipelines, datasets, adapters, and a reproducible run script. See examples/10_Edge_Agentic_Example/README.md.

Type of change

  • Bug fix
  • [x ] New feature
  • Documentation update
  • Refactor/cleanup

Related issues

Self-contained: this PR includes the numpy relaxation (numpy>=1.26.4) required by bfcl-eval (which hard-pins numpy==1.26.4), so it no longer depends on a separate prerequisite PR. The [bfcl] extra is isolated via [tool.uv].conflicts so its old pins don't constrain the shared tooling deps. This supersedes #345 (closed as redundant).

Testing

  • Tests added/updated
  • All tests pass locally
  • Manual testing completed

Checklist

  • Code follows project style
  • Pre-commit hooks pass
  • Documentation updated (if needed)

@Palanivelg Palanivelg requested a review from a team June 8, 2026 17:06
@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for Berkeley Function Calling Leaderboard (BFCL) v4 accuracy benchmarking, introducing single-turn and multi-turn evaluation pipelines, datasets, and adapters, along with sequential sample ordering for deterministic evaluation. The code review feedback identifies three key improvements: moving the tool_calls parsing loop inside the try-except block in bfcl_v4_execution.py to prevent crashes on invalid JSON, conditionally including tools and tool_choice in the request payload in bfcl_v4_multi_turn_runner.py to avoid API errors when no tools are present, and guarding the n_repeats calculation in bfcl_v4_scorer.py to ensure it does not evaluate to zero if some samples fail.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread src/inference_endpoint/evaluation/bfcl_v4_execution.py Outdated
Comment thread src/inference_endpoint/evaluation/bfcl_v4_multi_turn_runner.py
Comment thread src/inference_endpoint/evaluation/bfcl_v4_scorer.py Outdated
@Palanivelg Palanivelg changed the title Feat/bfcl v4 combined feat(bfcl): add BFCL v4 edge-agentic accuracy integration Jun 8, 2026
Palanivelg added 14 commits June 8, 2026 17:16
Integrate Berkeley Function Calling Leaderboard (BFCL) v4 evaluation
into the accuracy pipeline. Covers both single-turn function-calling
subsets (non_live, live, hallucination) and the agentic multi-turn
subsets (multi_turn_base + variants), validated against evalscope.

Single-turn (drop-in scorer):
  - BFCLv4 predefined dataset (categories=[non_live|live|hallucination],
    configurable sample_pct) plus a default preset.
  - BFCLv4Scorer wires bfcl-eval's ast_checker into the standard
    accuracy phase via the existing scorer registry.
  - FunctionCallExtractor: normalizes native tool_calls JSON, JSON
    arrays of function-call objects, and text-form function calls
    into the canonical BFCL input format.
  - openai_msgspec_adapter now:
      * passes tool_choice through,
      * emits tool_calls verbatim as output_text for scoring,
      * coerces whole-number temperatures (vLLM strictness),
      * makes max_completion_tokens optional and uses a permissive
        ColumnFilter so messages+tools datasets pass through.

Multi-turn (agentic loop, outside the standard scorer):
  - BFCLExecutionBridge: parses tool_calls, executes them locally
    against bfcl-eval's class instances, and constructs the tool
    response messages for the next turn.
  - BFCLMultiTurnRunner: drives the per-entry agentic conversation
    via httpx (bounded by max_steps_per_turn / timeout_s).
  - BFCLv4MultiTurnScorer: invokes bfcl-eval's multi_turn_checker.
  - bfcl_v4_multi_turn_cli: standalone CLI for the multi-turn flow.

Supporting plumbing:
  - SequentialSampleOrder + `sequential=` flag on create_sample_order,
    used by accuracy phases so ordering matches reference runs.
  - BenchmarkConfig.Dataset.params: dict for dataset-specific kwargs
    (e.g. categories, sample_pct) plumbed through DataLoaderFactory.
  - ScorerMethod.BFCL_V4.
  - Dataset.load() preserves user-provided ColumnFilters when the
    adapter would otherwise inject a conflicting one.
  - `--accuracy-only` benchmark flag: skip the performance phase
    entirely (forces num_workers=1, max_connections=1 for
    deterministic per-sample ordering).

Optional dep: `pip install -e ".[bfcl]"` (`bfcl-eval==2026.3.23`).
The top-level numpy pin must be relaxed to `>=1.26.4` because
bfcl-eval hard-pins `numpy==1.26.4` — shipped as a separate
prerequisite commit on the chore/relax-numpy-pin branch.

Validation (Qwen3.6-27B-Q4_K_M, temperature=0):
  - Single-turn live (10%): ~82% accuracy.
  - Multi-turn base (full 200): 140/200 = 70.00%, exact parity
    (100% match, 0 mismatches) with evalscope on identical inputs.
Add a committed config that reproduces the validated <3h BFCL v4 accuracy
run on an embedded device (Thor).

Dataset (BFCLv4.generate):
  - category_sample_pct: per-category sampling rates (e.g. non_live 20 /
    live 10 / hallucination 5), resolved per subset via SUBSET_TO_CATEGORY.
  - subset_floor: subsets whose TOTAL size is <= floor are taken in full,
    preventing tiny subsets (live_parallel=16, live_parallel_multiple=24)
    from collapsing to one or two noisy samples. Selection stays
    deterministic (head(n)). Plain sample_pct behavior is unchanged.

Multi-turn CLI:
  - --sample-pct: deterministic per-subset sub-sampling so a 3% (~24 entry)
    multi-turn run is reproducible. Defaults to all entries.

examples/10_BFCLv4_Example/:
  - offline_bfcl_v4_single_turn.yaml: single-turn (non_live + live +
    hallucination) accuracy config, run with --accuracy-only.
  - README.md: documents the two run paths (single-turn YAML vs multi-turn
    CLI), the per-category sampling + floor, and the ~2h49m Thor budget.
The example and docstring referenced a specific device; reword to the
generic "edge device" since the <3h budget applies to embedded targets
broadly.
In --accuracy-only runs there is no performance phase, so ctx.rt_settings
is None. _run_benchmark_async read ctx.rt_settings.max_duration_ms
unconditionally, raising AttributeError at session setup. The global
timeout only applies to the performance phase, so default max_duration_ms
to None when rt_settings is absent.
--accuracy-only forces a single connection for deterministic sample
ordering, which serializes the offline MAX_THROUGHPUT burst. For large
accuracy datasets the sequential processing time exceeds the hardcoded
240 s drain cap, so the phase aborted mid-run dropping in-flight samples.

Make drain_timeout a per-phase field defaulting to 240.0 (performance
phases unchanged). Accuracy phases pass None to drain unbounded, since
every sample must complete; a dropped transport still unblocks the wait
via the _receive_responses close path. Re-check inflight after clear()
to close a completion/clear race on the unbounded path.
The msgspec adapter serialized tool_calls into `output` AND kept them in the
structured `tool_calls` field. TextModelOutput.__str__ then appended the
tool_calls a second time, producing duplicated, malformed JSON
(`[{...}][{...}]`) that FunctionCallExtractor could not parse. This made every
single-turn function-calling subset score 0% (and gave hallucination subsets a
spurious 100%).

Keep `output` as the textual content only; the structured tool_calls field is
the single source serialized once by __str__. This matches the non-msgspec
OpenAI adapter and the streaming accumulator, which already did this.

Multi-turn is unaffected: it uses a separate httpx runner that reads structured
tool_calls from the raw response and never touches TextModelOutput.
…mbles

When a model emits a prose preamble alongside a tool call (e.g.
"To compute this, I'll call...\n[{...}]"), str(TextModelOutput) prefixes the
tool-call JSON with that text, so the function-call parser's json.loads fast
path fails and the sample scores 0.

Override BFCLv4Scorer.get_outputs to prefer the structured tool_calls field
when present (the function call is the answer; the prose is chatter), falling
back to the full string for plain-text responses such as hallucination
refusals. Verified deterministic across repeated fresh-server single-turn runs.
Some servers (e.g. trtllm-serve on edge devices) stall when tools are present
but tool_choice is omitted, relying on a server default. Set tool_choice="auto"
explicitly on each single-turn sample and pass it through the function_calling
preset's ColumnFilter so it reaches the request payload.

Multi-turn already sends tool_choice="auto" via its dedicated runner, so this
only affects the single-turn path.
Add ModelParams.seed field and propagate it to the OpenAI wire format
so runs can be made reproducible:

- config/schema.py: add seed field to ModelParams
- openai/types.py: add seed field to ChatCompletionRequest
- openai/openai_adapter.py: include seed in metadata dict
- openai/openai_msgspec_adapter.py: include seed in metadata dict and
  ChatCompletionRequest construction
- evaluation/bfcl_v4_multi_turn_runner.py: accept seed param; inject
  payload["seed"] when set
- evaluation/bfcl_v4_multi_turn_cli.py: expose --seed CLI arg and pass
  it to BFCLMultiTurnRunner
- commands/benchmark/cli.py: expose --seed and --report-dir overrides
  on the from-config subcommand
- tests: unit coverage for seed propagation in msgspec adapter,
  multi-turn runner, and from-config CLI
Expand examples/10_BFCLv4_Example/README.md:
- Add a "Reproducing from the PRs" section explaining that PR #1
  (numpy pin) is a prerequisite for PR #2 to install [bfcl]
- Show how to check out and install from the branches
- Document --seed flag for both single-turn (from-config) and
  multi-turn CLI paths
- Replace placeholder accuracy numbers with confirmed Thor validation
  results (Qwen3.6-27B-Q4_K_M, temperature=0, 456 ST samples):
    non_live 86.98%, live 84.12%, hallucination 94.32%, overall 87.50%
    (both seed runs identical); MT base 70.00% (exact evalscope parity)
- Add output file paths and a quick sanity-check script
Replace the terse reference doc with a numbered walkthrough that
works for someone unfamiliar with the endpoints repo:
- What is this / What is the endpoints repo
- Step 0: prerequisites including a llama.cpp Docker quick-start
- Step 1: install from the two PRs with conflict explanation
- Step 2: run single-turn (with YAML config notes)
- Step 3: run multi-turn
- Step 4: verify results with one-liners
- Seed reproducibility section
- Reference results table (Thor, two seed runs, evalscope parity)
- Fix memory requirement: ~24 GB (not 16 GB) for the Q4 GGUF + KV cache
- Replace generic Docker quick-start with Thor-specific llama.cpp native
  build instructions (Docker CUDA images don't target sm_110/aarch64-SBSA)
- Add x86_64 Docker quick-start in a collapsible details block
- Fix Step 4 result path: results.json under accuracy_scores key, not
  a separate accuracy_scores.json; add report.txt note
- Add server-side determinism note (--seed 42, -np 1 on llama-server)
- Replace placeholder MT numbers with actual sampled-run Thor results:
    multi_turn_base 66.67% (4/6), miss_func 33.33% (2/6),
    miss_param 16.67% (1/6), long_context 66.67% (4/6),
    overall 45.84% (24 entries) — identical across both seed runs
- Separate full multi_turn_base parity result (140/200 = 70.00%) into
  its own subsection to avoid conflating sampled and full-set numbers
- Update wall-clock: ~82 min ST + ~64 min MT ≈ 2.4–2.5 h total
bfcl-eval's Qwen model handler transitively imports qwen_agent which
requires soundfile; without it the import fails on Thor and any machine
where soundfile is not already installed.
… run_accuracy.sh

Renames examples/10_BFCLv4_Example to examples/10_Edge_Agentic_Example to
align with the MLPerf edge-agentic submission category name.

Adds run_accuracy.sh — a single script that reproduces both single-turn and
multi-turn reference accuracy numbers end-to-end with the exact validated
parameters (sampling rates, temperature=0, seed=42, max-steps-per-turn=25).

Updates README to lead with the one-liner quick-start referencing the script,
fixes the install instructions to point to mlcommons/endpoints (not the fork),
adds --seed and --max-steps-per-turn to the Step 3 MT snippet, and corrects
the internal path reference in online_agentic_coding_perf.yaml.
@Palanivelg Palanivelg force-pushed the feat/bfcl-v4-combined branch from 9f0186a to 37c74d2 Compare June 8, 2026 17:21
- bfcl_v4_execution: move tool-call argument JSON parsing inside the
  try-except block so json.JSONDecodeError from malformed model output
  (common on small/quantized models) is caught and handled gracefully
  rather than crashing the evaluation run.

- bfcl_v4_multi_turn_runner: only include tools/tool_choice in the
  request payload when the tools list is non-empty, avoiding 400 errors
  on endpoints that reject tool_choice without accompanying tools.

- bfcl_v4_scorer: guard n_repeats with max(1, ...) so a partial run
  where fewer samples completed than num_samples() does not produce a
  zero divisor and incorrect reporting.
- Relax base numpy to >=1.26.4 so bfcl-eval's numpy==1.26.4 pin resolves;
  regenerate uv.lock.
- Regenerate stale *_template_full.yaml config templates after schema change.
- Fix mypy: annotate tool_calls/tool_call_ids and narrow Optional
  messages/tools in the multi-turn runner; mark BFCLv4Scorer.score override.
- Isolate the bfcl extra via [tool.uv].conflicts and add patched
  filelock/virtualenv floors to the dev extra so bfcl-eval's filelock==3.20.0
  pin no longer drags shared tooling deps into CVE versions
  (CVE-2025-68146, CVE-2026-22701, CVE-2026-22702).
The accuracy phase hardcoded drain_timeout=None, which ignored a
user-configured DrainConfig.accuracy_timeout_s and failed
test_configured_drain_timeouts_propagate_to_phases. accuracy_timeout_s
already defaults to None (unbounded), so reading it preserves the
unbounded default while honoring an explicit timeout.
from-config has no --model-params.name / --endpoint-config.endpoints
overrides, so the script errored immediately. Render a temp YAML with
MODEL/ENDPOINT substituted into the committed config (anchored on the
"# set to your ..." comments) so the one-liner still works without
editing the tracked file.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant