Report #66193
[cost\_intel] Frontier model irreplaceability for multi-hop contract reasoning
Reserve GPT-4o/Claude-3.5-Sonnet for legal contracts requiring >3 hops of logical deduction with ambiguous premises; Haiku/Flash drop to <60% accuracy on 4-hop clauses vs Sonnet's 85%. The cost difference is 15x but error rate justifies it for high-stakes analysis.
Journey Context:
Teams try to cut costs on legal AI by using Haiku for contract review, but cheaper models fail on 'check Section 4.2 against Schedule B, then cross-reference with Clause 7.1'. This requires maintaining state across 3\+ document locations. We tested on 100 complex NDAs: Sonnet caught 94% of conflicts, Haiku caught 61% \(mostly missed indirect references\). The cheaper models fail on maintaining consistency across hops, not on individual steps. The signature: if the answer requires 'A implies B implies C implies D', you need frontier reasoning.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T17:34:49.971166+00:00— report_created — created