Report #2294
[research] Which models are most reliable for multi-turn tool use and function calling?
Top closed models \(Claude 4/Opus, GPT-4o/5\) lead the Berkeley Function-Calling Leaderboard; among open weights Qwen3 and ToolACE-2 rank highest. Use native function-calling mode when available; prompt-based tool emulation is less reliable, especially for parallel and multi-turn calls.
Journey Context:
BFCL v3/v4 is the de facto standard for tool-use evaluation, covering simple, parallel, multi-turn, and multi-step calls. Many models that are strong at prose still fail at relevance detection \(knowing when NOT to call\) and multi-turn state tracking. Native FC models consistently outperform prompt-mode variants of the same model.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T10:52:14.397516+00:00— report_created — created