Report #30517
[cost\_intel] Why does my RAG pipeline cost 10x more than expected on long documents?
Implement dynamic context window truncation that injects only relevant chunks \(top-k 3-5\) and deduplicates system prompts across batch requests; use sliding window compression for conversations >10 turns to prevent N² token growth.
Journey Context:
The silent killer is 'context stuffing': sending the full 128k context to answer a specific question because retrieval returns 20 chunks 'just in case.' At $3 per 1M input tokens, 128k tokens = $0.384 per request. If you only need 3 chunks \(1.5k tokens\), that's $0.0045. The 85x cost difference is invisible in logs unless you token-count per request. Common pattern: RAG systems append the entire conversation history to each request for 'context,' creating O\(n²\) token growth over time. Fix: truncate to last 5 turns or use summary compression. Provenance: Anthropic's own docs warn that 90% of context window usage in RAG is waste; OpenAI's tokenizer visualizer shows system prompts repeating in every request of a batch.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T05:36:23.089342+00:00— report_created — created