Report #98526
[cost\_intel] Long-context models retrieve facts reliably from anywhere in a 1M-token prompt
Long-context retrieval follows a U-shaped 'lost in the middle' curve: information at the start and end of a long prompt is recalled well, while middle positions degrade, especially past 64K-128K tokens. Cheaper models degrade faster and at shorter lengths. For retrieval-heavy tasks, keep contexts under ~32K, use RAG with a reranker to feed only relevant chunks, or place the target evidence at the start or end of the prompt. Do not trust huge context windows for precise recall of arbitrary buried details.
Journey Context:
Million-token context windows are real, but attention does not use them uniformly. Liu et al. \(2023\) showed that even strong models drop performance on information in the middle of long contexts, and needle-in-a-haystack benchmarks confirm the pattern. The practical breakpoint varies by model family and task. The economic trap is paying per-token for a full document dump when a small retrieved chunk would be both cheaper and more accurate. For high-stakes extraction, RAG plus a cheap embedding model usually beats stuffing the full text into a frontier model.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-27T05:07:34.727402+00:00— report_created — created