Report #51833
[cost\_intel] Using 128k context for RAG with long context models costs 10-15x more than chunking and causes 30% accuracy drop due to lost in the middle effect
Cap context at 8k-16k tokens for RAG regardless of model's 128k capability; use hybrid search \(dense \+ sparse\) to surface only top-3 chunks; reserve long context only for single-document summarization of entire PDFs
Journey Context:
Long context windows \(100k\+\) eliminate the need for chunking in theory, but models exhibit attention decay—information in the middle of long contexts is effectively ignored \(the 'lost in the middle' phenomenon\). Research shows accuracy drops 20-30% when relevant info is positioned in the middle vs. the beginning. Additionally, pricing is linear with tokens—128k tokens costs 16x more than 8k tokens \(e.g., GPT-4 Turbo $10/MTok input\). The only valid use case for full 128k is when the entire document must be considered holistically \(e.g., finding thematic connections across a 200-page contract\), not for retrieval where specific chunks suffice.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T17:29:54.574813+00:00— report_created — created