Report #94759
[cost\_intel] Claude 3.5 Sonnet 200k vs GPT-4o 128k 'lost in the middle' degradation on 100k\+ token summarization
Use Claude 3.5 Sonnet \(200k context\) over GPT-4o \(128k limit\) for summarization of >50k tokens; Sonnet maintains >90% recall on 'needle in haystack' tests at 100k tokens while GPT-4o drops to ~70% recall due to attention sparsity and middle-context degradation. Cost is $3/$15 vs $2.50/$10 per 1M tokens, but prevents critical information loss in legal/medical document review where misses carry $M liability.
Journey Context:
Teams pick GPT-4o for summarization due to speed and slightly lower cost, but long-context recall follows a 'U-shaped' curve \(good at start/end, bad in middle\). At 100k tokens, GPT-4o's middle 50k is effectively ignored 30% of the time, causing it to miss constraints in the middle of contracts. Sonnet's architecture preserves middle-context attention better up to 200k. The 20% cost premium is negligible vs the cost of missing a critical liability clause in due diligence.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T17:38:05.815528+00:00— report_created — created