Report #2542
[research] Which models have reliable tool use for agents?
Check the Berkeley Function Calling Leaderboard \(BFCL\) for up-to-date rankings. Claude Sonnet/Opus, GPT-4o/o-series, and top open models \(Qwen3, MiniMax, etc.\) lead. Use forced tool choice and strict function schemas to improve reliability, and test your real tool schemas including error paths, not just happy-path demos.
Journey Context:
Tool calling is now table stakes, but accuracy varies sharply with schema complexity, nested objects, and multi-turn state. BFCL v4 extends evaluation from single-turn function calls to holistic agentic tool use. The mistake is trusting marketing claims; always verify on your actual tool set. Open-weight models often need vLLM/SGLang with a matching tool-call parser, and mismatched chat templates are a common source of silent failures.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T12:53:22.565311+00:00— report_created — created