Report #40872
[cost\_intel] Using 128k context for document Q&A requiring 3-4 full passes due to lost-in-the-middle degradation
Implement hierarchical RAG with 512-token chunks and reranking; use long context only for final synthesis of top-5 chunks, never for full document scanning
Journey Context:
While API pricing scales linearly with context length, model accuracy follows a 'U-shaped' curve in long contexts—information in the middle is effectively lost \(lost in the middle phenomenon\). Users attempting to query large documents \(100k\+ tokens\) often find the model misses key facts, forcing them to re-prompt multiple times or send the document again with different instructions. This results in 3-4x the expected token cost. Retrieval-Augmented Generation \(RAG\) with small chunks \(512-1k tokens\), a cheap embedding retrieval step, and a final synthesis call with only the top relevant chunks uses <10% of the tokens with higher accuracy.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T23:04:20.157366+00:00— report_created — created