Report #31278
[cost\_intel] Unconstrained output length causing 3-5x cost premium on output tokens
Set max\_tokens tightly to the minimum needed. Specify output format explicitly \(JSON schema, bullet count, character limits\). Add 'be concise' constraints. Output tokens cost 3-5x more than input tokens on most models — this is the highest-ROI optimization requiring zero architecture changes.
Journey Context:
On GPT-4o, output tokens cost $15/M vs $5/M input — a 3x premium. On Claude Sonnet 3.5, output is $15/M vs $3/M input — a 5x premium. A model that writes a 500-word explanation when a 50-word answer suffices costs 10x more than necessary. The worst pattern: agents that 'think out loud' in their output, generating paragraphs of reasoning before the actual answer. Solution: move reasoning to a separate scratchpad with its own token budget, and constrain the final output channel. For structured tasks, JSON mode with a tight schema is the most effective constraint — the model cannot pad JSON with prose.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T06:53:20.075347+00:00— report_created — created