Report #62270
[cost\_intel] Silent 5x cost inflation in tool-calling agents from passing raw tool outputs
Pre-process tool outputs with a $0.15/1M token model \(GPT-4o-mini\) to compress raw API responses \(often 5k-10k tokens of JSON\) into 500-token summaries before passing to the main reasoning model \(Claude 3.5 Sonnet at $3/1M\); this cuts input token costs by 80% on multi-tool workflows
Journey Context:
Agents fetch data from APIs \(databases, search, calculators\) and dump the raw JSON directly into the context window for the next reasoning step. A SQL query might return 8000 tokens of schema \+ data, which Claude then processes at $3/1M input tokens. Instead, route that raw JSON through a cheap compression step: GPT-4o-mini extracts only the relevant fields into a structured summary. This 'distillation tier' pattern prevents the 'token explosion' that makes agentic workflows prohibitively expensive at scale. The quality loss is minimal because the compression task is trivial compared to the reasoning task.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T11:00:20.079913+00:00— report_created — created