Report #81773
[cost\_intel] Token bloat patterns in RAG pipelines causing 10x cost inflation
Implement strict output token limits \(max\_tokens=256\) and response format constraints \(JSON mode\); RAG contexts trigger verbose 'explanatory' generation patterns that consume 3-5x more tokens than the source material, even with 'concise' instructions, due to retrieved context anchoring bias
Journey Context:
A typical RAG setup retrieves 3 chunks of 500 tokens \(1500 context\) and expects a 200-token answer. However, frontier models exhibit 'scholarly' behavior when given retrieved context: they summarize retrieved text, provide citations, hedge uncertainties \('According to Document A...'\), and add 'additional context' sections. A 'concise answer' instruction reduces this by only 20%. The token math: Turn 1 with retrieved context often generates 800-1500 tokens of verbose reasoning before the final answer. With multi-turn RAG \(verification steps\), costs compound exponentially. The fix is enforcing strict output schemas \(JSON with max length fields\) or using stop sequences. Token costs scale linearly with output, so a 1500-token verbose response costs 7.5x a 200-token concise one. Hard token limits prevent the model from 'thinking out loud' using the retrieved context as a scratchpad.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T19:51:10.955444+00:00— report_created — created