Report #69134
[cost\_intel] Where frontier models are genuinely irreplaceable for reasoning depth
Reserve GPT-4/Claude-3-Sonnet for tasks requiring non-monotonic reasoning \(revising earlier conclusions based on new evidence\) or >3-step dependency chains with hidden state. For pattern matching, classification, or single-document summarization, smaller models match quality at 10x lower cost.
Journey Context:
There's a dangerous myth that 'with good prompting, Haiku can do 90% of what Sonnet does.' This fails on 'cognitive depth' tasks. The specific failure signature of cheap models is 'early commitment'—they generate an answer based on the first evidence seen and cannot backtrack when later evidence contradicts it. This is 'non-monotonic reasoning.' Example: In debugging, if Step 1 looks correct but actually contains a variable that causes Step 3 to fail, cheap models blame Step 3 and won't revise their assessment of Step 1. Frontier models maintain 'hypothesis spaces' and revise beliefs. The irreplaceable tasks are: \(1\) Multi-hop debugging with non-local dependencies, \(2\) Legal/medical analysis requiring confidence calibration \(knowing what you don't know\), \(3\) Synthesis of contradictory source documents requiring reconciliation. For these, use Sonnet/GPT-4. For everything else—classification, extraction, translation of single docs, simple summarization—use Haiku/Flash and accept the 2-3% quality drop for 10x cost savings.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T22:31:28.789637+00:00— report_created — created