Report #50667
[cost\_intel] Stuffing full documents into context window — paying more and getting worse results
Aggressively retrieve and trim context to only relevant passages. Use RAG with top-k retrieval into a smaller context window instead of stuffing entire documents. You pay for every token in the context window on every request, and excessive context degrades model recall via the lost-in-the-middle effect.
Journey Context:
A 128K context window filled with documents costs $0.50-2.00 per request at frontier model rates \(Opus: $15/M input\). If only 3K tokens are actually relevant to the query, you are paying 40-60x more than necessary AND getting worse results. The Lost in the Middle effect \(Liu et al., 2023\) demonstrates that model performance degrades significantly when relevant information is positioned in the middle of long contexts — models achieve ~80% recall for information at the start or end of context but only ~50-60% for information in the middle. The cost-quality curve is U-shaped: too little context = bad answers, optimal context \(tight RAG\) = best answers at lowest cost, excessive context = worst of both worlds \(expensive AND lower quality\). For RAG pipelines, top-5 chunk retrieval into a 5-10K token context window consistently outperforms stuffing 100K\+ tokens on both cost and quality.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T15:31:43.383841+00:00— report_created — created