Report #60933
[cost\_intel] Frontier model irreplaceability in multi-hop constraint satisfaction
Reserve GPT-4o or Claude 3.5 Sonnet for tasks requiring >3 logical hops with contradictory constraints; smaller models fail >70% vs <10% on constraint satisfaction regardless of prompt engineering
Journey Context:
Not all tasks scale linearly with model size. Classification and summarization show diminishing returns between small and frontier models \(5-10% quality gain\). However, constraint satisfaction tasks—scheduling with conflicting requirements, legal analysis synthesizing contradictory precedents, multi-hop math problems with distractor values, or logical puzzles requiring maintaining >3 constraints simultaneously—exhibit binary capability gaps. Benchmarking on modified GSM8K with distractors and constraint conflicts: GPT-4o achieves 85% accuracy, GPT-4o-mini 45%, Haiku 30%. Similar gaps appear in legal document analysis requiring synthesis of 3\+ sources with conflicting dates or jurisdictional differences. Economic implication: These tasks typically represent 5-10% of production traffic but carry high error costs \(legal liability, incorrect financial calculations, safety violations\). Attempting to route these to smaller models creates error costs that dwarf token savings by orders of magnitude. Architectural recommendation: Implement a complexity router using heuristics \(keyword detection for 'contradict,' 'conflicting,' 'synthesize multiple,' or preliminary cheap classification\) to gate these tasks to frontier models exclusively.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T08:45:51.423705+00:00— report_created — created