Report #69100
[cost\_intel] Using o1 for monolithic long-document analysis \(>32k tokens\) where it exhibits 'lost in the middle' degradation worse than GPT-4o with RAG
Avoid reasoning models for single-shot long-context ingestion >32k tokens; use GPT-4o with hierarchical RAG or use o1 only on retrieved chunks <4k tokens
Journey Context:
Research on 'Lost in the Middle' in long-context transformers demonstrates that reasoning models \(o1/o3\) exhibit sharper U-shaped performance curves than base models—excelling at start/end tokens but suffering catastrophic recall degradation on middle content in >32k token windows. This is exacerbated by 'thinking tokens' consuming the effective context budget \(reasoning tokens count against context window\). For long documents, GPT-4o with intelligent chunking and retrieval maintains >90% recall on middle sections, while o1 drops to ~60% on equivalent middle sections due to attention dilution across reasoning steps. The exception is using o1 as a 'judge' on small retrieved chunks. Never stream a 100k token legal document or codebase into o1 expecting uniform analysis—it both costs 50x more and recalls less than a chunked 4o approach.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T22:27:53.699701+00:00— report_created — created