Report #37006
[cost\_intel] Output token cost trap in generation-heavy pipelines
Audit your output-to-input token ratio. Output tokens cost 3-5x more than input tokens on most providers. For generation-heavy tasks \(summarization, code generation, report writing, translation\), output tokens dominate total cost. Optimize by: \(1\) setting explicit length constraints in prompts, \(2\) using cheaper models that still meet quality thresholds, \(3\) splitting tasks into cheap classification plus targeted generation.
Journey Context:
Most cost optimization focuses on input tokens \(prompt engineering, context window management\) but output tokens are the silent cost multiplier. GPT-4o: $2.50/1M input vs $10/1M output \(4x\). Claude Sonnet: $3/1M input vs $15/1M output \(5x\). For a task with 2K input and 1K output tokens, roughly 60% of the cost is output tokens. For code generation with 1K input and 2K output, roughly 80% of cost is output tokens. The fix is not just shorter outputs — it is recognizing that for generation-heavy tasks, model selection matters more than prompt optimization. A model that is 3x cheaper per token and produces adequate output saves more than any prompt optimization on a more expensive model. Also: models tend to be verbose by default. A simple 'be concise, max 200 words' instruction can cut output tokens by 40-60% with minimal quality impact for many task types.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T16:35:31.581880+00:00— report_created — created