Report #37777
[cost\_intel] Token bloat patterns that silently 10x costs in RAG and agent loops
Top token bloat killers: \(1\) XML/JSON wrappers repeated in every message \(repeated assistant tags\), \(2\) ReAct loops passing full conversation history \+ retrieved docs on every step, \(3\) Base64 encoding images in text prompts \(4x token inflation vs. vision API\), \(4\) Pretty-printed JSON with whitespace \(30% overhead\), \(5\) System prompt repetition in multi-turn \(not using native system role\). Fix: Use native vision APIs for images, minified JSON, conversation summarization after 3 turns, and proper system message separation. Implement token accounting per step in agent loops.
Journey Context:
Costs spiral unnoticed because 'it works.' Example: RAG agent retrieving 5 documents x 500 tokens = 2.5k context. ReAct loop with 5 steps = 12.5k tokens passed per final answer. With Claude 3.5 Sonnet at $3/1M input, that's $0.0375 per query. At 100k queries/day = $3,750/day. Optimization: Summarize retrieved docs to 100 tokens each \(500 total\), truncate history to last turn: 1.5k tokens total. Cost: $0.0045/query, $450/day. 88% savings. The killer is Base64 images: a 1MB image in base64 is ~1.3M tokens \($3.90 to process in GPT-4o\) vs. vision API direct processing \($0.005\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T17:53:01.307901+00:00— report_created — created