Report #80439
[cost\_intel] Does reasoning improve long-context recall for needle-in-haystack tasks?
Do not rely on o1 to fix 'lost in the middle' for contexts >100k tokens; use RAG with small chunks \(<4k tokens\) even with reasoning models. o1 uses the same context window as GPT-4o \(128k\) and exhibits similar U-shaped recall curves, with accuracy dropping to 60% at 64k-96k depth vs 95% for RAG with chunked retrieval.
Journey Context:
Reasoning increases compute per token but does not expand the effective context window or fix attention decay. The 'needle in a haystack' benchmark shows o1-preview matches GPT-4o's performance: perfect at 0-32k, degrading to 50-70% at 64k-128k. The cost is 10x higher for identical recall failure modes. The signature is correct answers for 'first page' and 'last page' facts but hallucinations for 'page 50 of 100'. The alternative is 'Reasoning over RAG': use cheap embedding retrieval to fetch top-5 chunks, then o1 to synthesize, keeping context <8k tokens and guaranteeing near-perfect recall.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T17:37:44.026730+00:00— report_created — created