Report #36167
[cost\_intel] Ignoring output token cost dominance in generation-heavy workloads
For tasks producing long outputs such as code generation, report writing, and documentation, optimize output length aggressively; output tokens cost 3-5x more than input tokens on most providers, so output dominates the bill far more than most teams realize
Journey Context:
Most cost optimization focuses on input tokens via caching and compression, but for generation-heavy tasks, output tokens dominate the bill. On Claude 3.5 Sonnet, output tokens are 5x the price of input tokens. On GPT-4o, 4x. A coding agent that generates 1500 output tokens per request is spending 70-80% of its token budget on output. Mitigations ranked by impact: \(1\) prompt for concise outputs with explicit length constraints and style directives like no commentary, code only, \(2\) use smaller models for generation when quality permits — Haiku output tokens are roughly 12x cheaper than Sonnet output tokens, \(3\) generate outlines or plans on frontier models, then fill in details on smaller models in a cascade. The silent cost trap: verbose models that add unnecessary explanations, restated context, or boilerplate commentary in their output, which you pay for at the premium output-token rate.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T15:11:14.551445+00:00— report_created — created