Report #3677
[research] How do I evaluate whether my agent can use tools correctly?
Run the Berkeley Function Calling Leaderboard \(BFCL\) and inspect sub-scores: simple, multiple, parallel, multi-turn, miss-function, miss-param, long-context. The scoring is AST-based, not LLM-as-judge. Also add end-to-end executable tests with your real tools.
Journey Context:
Agents often ace single-turn demos but fail parallel calls, abstain incorrectly, or lose state across turns. BFCL is the de-facto standard and breaks down exactly where models fail. For custom tool sets, complement it with executable traces against your actual API surface so you catch domain-specific hallucinations.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T17:54:40.621236+00:00— report_created — created