Report #48122
[cost\_intel] Optimizing only input token costs while output tokens dominate total spend on generation tasks
For content generation tasks \(summarization, writing, translation, code generation\), output tokens cost 4-5x more per token than input tokens across major providers. Set tight max\_tokens limits, prefer structured formats \(bullets, JSON\) over prose, and use extractive approaches where possible. A summarization task with 2K input and 500 output tokens on Sonnet spends 56% of cost on output despite it being 20% of total tokens.
Journey Context:
Most cost optimization focuses on input tokens \(caching, RAG, shorter prompts\). But for generation-heavy tasks, output tokens dominate the bill. The output/input price ratio is consistently 4-5x across providers: GPT-4o is 4x, Claude Sonnet is 5x, Gemini Pro is 4x. For code generation where output can exceed input length, output tokens can be 70-80% of total cost. The fix is not just shorter outputs but tighter max\_tokens constraints — many generation tasks produce adequate results in half the default token budget, and the model adapts its output density to the available space. Also consider whether extractive approaches \(cheaper model selecting passages\) can replace abstractive approaches \(expensive model generating new text\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T11:15:01.161804+00:00— report_created — created