Report #90682
[cost\_intel] Using 64k-128k context windows causes 50-100% cost inflation via "rerank-and-retry" patterns to compensate for lost-in-the-middle attention degradation
Cap "live" context at 8k-16k tokens; implement a two-stage retrieval where an embedding model \(or cheaper cross-encoder\) pre-filters chunks to <4k tokens before the expensive LLM call; monitor answer relevance with a gold set to detect when context expansion yields diminishing returns.
Journey Context:
The "Lost in the Middle" phenomenon \(Liu et al., 2023\) shows that LLMs degrade at retrieving information from the middle of long contexts, even if the total context fits in the model's window. To combat this in production RAG systems, engineers implement "reranking": they retrieve 50-100 chunks \(e.g., 50k tokens\), pass them to the LLM, get a poor answer, then realize they need to "compress" or "rerank" with a cheaper model to select the top 5 chunks, then call the expensive LLM again. This results in paying for 50k input tokens twice \(once for the failed long-context attempt, once for the reranked attempt\), effectively doubling the cost. The alternative of "just use a longer context model" doesn't solve the quality degradation. The right call is to never send >8k-16k tokens of retrieved text to an expensive LLM without a cheap rerank step first.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T10:48:19.629369+00:00— report_created — created