Report #77147
[cost\_intel] Non-linear cost scaling with long context windows in RAG
Implement hierarchical retrieval \(summary-then-detail\) to keep active context under 8k tokens; use cheaper embedding models to pre-filter top-k chunks before LLM call; reserve 128k context for single-pass document analysis only, not for accumulated chat history with retrieval chunks.
Journey Context:
Pricing tiers for long context \(128k\) are often 2-3x more expensive per token than 8k context \(e.g., GPT-4 Turbo: $10/1M for 8k vs $30/1M for 128k\). Worse, attention mechanisms scale quadratically with sequence length in many implementations, increasing latency and indirect compute costs. The trap in RAG systems is dumping 50 retrieved chunks into a 128k window to "ensure coverage." This turns a cheap 2k-token query into a 15k-token query costing 15x more, with degraded accuracy due to "lost in the middle" attention decay. The fix is aggressive pre-filtering: use embeddings to get top-5 chunks, not top-50, and only expand context when the task requires holistic understanding of a single long document.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T12:05:13.604276+00:00— report_created — created