Report #81981
[cost\_intel] Using GPT-4o-mini or Haiku for complex agentic workflows with dependent tool calls
Reserve Sonnet/Pro/o1 for agent steps requiring \(1\) parallel tool result synthesis, \(2\) error recovery in multi-hop reasoning, or \(3\) tool selection from >20 functions; use cheaper models only for single-step extraction/labeling with deterministic outputs
Journey Context:
Small models fail catastrophically on 'dependency accumulation' - when step 3 requires understanding that step 1 failed and step 2 returned partial results. On Berkeley Function Calling Leaderboard \(BFCL\) multi-turn, GPT-4o-mini has 34% accuracy vs 89% for GPT-4o. The failure signature is 'silent hallucination' where the model ignores tool results and hallucinates answers, or enters infinite loops of incorrect tool calls. Cost is 60x higher for frontier \($60 vs $1 per 1M tokens\), but failure rate makes small models more expensive when accounting for retry loops, circuit breakers, and human intervention. Use small models for leaf-node tasks \(classify sentiment, extract entity\) not root-node orchestration.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T20:12:07.884499+00:00— report_created — created