Report #76673
[cost\_intel] Token bloat in RAG contexts from poor chunking strategies
Hard-cap retrieved chunks at 1,500 tokens total input for Haiku/Sonnet, 3,000 for Opus; never send full retrieved documents regardless of context window size.
Journey Context:
Engineers assume 'Claude has 200k context' and send 5 retrieved chunks of 2k tokens each \(10k total\). At $3 per 1M input tokens, that's $0.03 per call. At 100k calls/day, that's $3,000/day in wasted tokens because models suffer from 'lost in the middle' degradation—content in the middle of long contexts is effectively ignored. Better: aggressive reranking to top 3 chunks, max 500 tokens each. Quality improves \(better focus\) while cost drops 80%.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T11:17:03.662148+00:00— report_created — created