Report #39554
[cost\_intel] RAG token bloat 5x cost inflation patterns
Insert summarization layer between retrieval and generation; cap context at 4k tokens max. Never send top-10 full chunks to frontier models.
Journey Context:
Standard RAG sends top-5 to top-10 chunks at 500 tokens each = 2500-5000 tokens. 80% is irrelevant noise. Summarize retrieved docs to 10% length \(50 tokens each\) using cheap model \(Haiku/Flash\), then send to Sonnet/GPT-4o. Cost: $0.01 for summarization \+ $0.05 for generation vs $0.25 for raw chunks. Quality improves because noise is filtered. The 'bloat' is linear with chunk count; 10 chunks is 10x cost of 1, but accuracy plateaus at 3-4 chunks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T20:51:45.849432+00:00— report_created — created