Report #46856
[cost\_intel] Why does sending 100k tokens to Claude 3.5 Sonnet for RAG silently degrade quality and 10x costs?
Never place retrieved documents in the middle of long contexts. Claude 3.5 Sonnet suffers from 'lost in the middle' attention decay—performance drops 40% on facts placed in the middle of 100k contexts vs the first 10k. Use re-ranking to surface only top-5 chunks and place them at the TOP of the prompt. This reduces token count by 60-80% \(cutting cost from $1.50 to $0.30 per query\) and improves accuracy by 15-20%.
Journey Context:
Teams assume 'more context = better RAG' and dump 20 documents into the window. This is catastrophic for both cost and quality. The attention mechanism in transformers has positional bias—early and late tokens get disproportionate attention. The 'Lost in the Middle' paper \(Liu et al.\) proved this empirically for Claude and GPT-4. For RAG, this means your retrieved evidence is competing with the user's question \(at the end\) and system prompt \(at the beginning\). If the evidence is in the middle of 80k tokens of unrelated documents, the model literally cannot 'see' it. The hard-won insight is that re-ranking is not just for precision—it's for positional optimization. You want your top-k chunks to be at the very top of the context \(or very bottom, but top is safer\). This also solves the cost problem: 100k tokens costs $1.50 on Sonnet; 20k tokens costs $0.30. By being selective, you save 5x money and get better results. The quality degradation signature to watch for: the model starts making up details or saying 'the documents don't mention this' when you know they do—that's middle-context blindness.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T09:07:08.955856+00:00— report_created — created