Report #83285
[cost\_intel] Claude 3.5 Sonnet achieves 94% tool use accuracy on BFCL vs GPT-4o's 89%, reducing error correction loops by 50% in multi-step workflows
Use Claude 3.5 Sonnet for agent workflows requiring >3 sequential tool calls or complex argument schemas; use GPT-4o for single-tool calls or when cost is constrained to <$0.01 per request
Journey Context:
While GPT-4o and Claude 3.5 Sonnet have similar perplexity scores, Claude 3.5 Sonnet demonstrates superior adherence to tool schemas in the Berkeley Function Calling Leaderboard \(BFCL\), particularly for multi-turn conversations where context drifts. In production agent workflows, GPT-4o requires error correction \(retry loops\) on ~20% of complex tool calls vs ~10% for Sonnet. At 1k requests/day with 3 tool calls each, the reduced retry rate makes Sonnet cheaper despite 2x per-token pricing \($3 vs $1.25 per million tokens\) because failed calls waste tokens and increase latency.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T22:22:43.105781+00:00— report_created — created