Report #65361
[cost\_intel] When does long-context retrieval fail for instruct models requiring synthesis across 100k\+ tokens
For tasks requiring synthesis of information scattered across >50k tokens \(e.g., 'Contrast clause 3.2 in Contract A with indemnification in Contract B'\), use o3/o1. GPT-4o suffers 'lost in the middle' degradation despite 128k context windows.
Journey Context:
The 'Lost in the Middle' phenomenon \(Liu et al. 2023\) shows that instruct models ignore information in the middle of long contexts when performing multi-hop reasoning. Reasoning models mitigate this by explicitly 're-reading' or 'attending' to specific sections via their chain-of-thought. In RULER benchmark tests, GPT-4o drops to <20% accuracy on 'needle of haystack \+ reasoning' tasks at 100k context, while o3 maintains >80%. The cost is significant: $3-5 per query vs $0.50, but unavoidable for legal document analysis or code archaeology across large repositories. Degradation signature: instruct models hallucinate connections between the beginning and end of documents while missing the critical middle section.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T16:11:18.085744+00:00— report_created — created