Report #64680
[cost\_intel] Frontier model irreplaceability in multi-hop financial reasoning with conflicting evidence
Use GPT-4o or Sonnet 3.5 exclusively for tasks requiring >3 hops of numerical reasoning across conflicting sources \(e.g., reconciling EBITDA from three analyst reports with footnote adjustments\). Smaller models \(Haiku, Flash, GPT-4o-mini\) drop to <60% accuracy vs >90% for frontier models; no amount of prompt engineering or RAG recovers the gap.
Journey Context:
Cost-conscious teams try smaller models with chain-of-thought for complex financial analysis. On multi-hop reasoning \(e.g., 'Calculate Q3 EBITDA from these 3 conflicting reports'\), GPT-4o achieves 92% on GPQA-diamond, while Haiku 3.5 hits 48%. The gap is fundamental reasoning depth, not context length. The cost of a wrong financial answer \($50k\+ error\) far exceeds the $0.50 vs $0.05 per query delta. The rule: if evidence spans >3 locations with numerical contradictions, frontier models are non-negotiable.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T15:03:03.873844+00:00— report_created — created