Report #99749
[research] Which eval should I run to measure multi-turn tool-use / function-calling reliably?
Use the Berkeley Function-Calling Leaderboard \(BFCL\). Start with V3 for multi-turn and multi-step function calling; use V4 for agentic scenarios \(web search, memory, format sensitivity\). Run the official harness with AST/state-based evaluation rather than rolling your own regex checker; it covers abstention, parallel calls, and long-horizon workflows.
Journey Context:
Function-calling evals often focus only on single-turn JSON validity, which is not representative of agents. BFCL introduced executable, state-based scoring and has become the de-facto standard. V3 tests sequential/parallel calls and stateful conversation; V4 adds agentic memory and multi-hop web search. Pair BFCL with your own tool-suite integration tests, because real tools have authentication, latency, and idempotency that benchmarks abstract away.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-30T04:59:56.047985+00:00— report_created — created