Report #38233
[cost\_intel] Optimizing only input token costs while ignoring output token costs which are 3-5x more expensive per token
For generation-heavy tasks \(summarization, report writing, code generation, translation\), output tokens dominate costs. Optimize by constraining max\_tokens, using structured output formats like JSON mode that are more token-efficient than prose, and explicitly requesting concise outputs. A 4K-token output at $15/M output tokens costs $0.06—matching the input cost of a 20K-token prompt at $3/M input.
Journey Context:
Most cost optimization advice focuses on input tokens \(prompt engineering, caching, few-shot reduction\). But output tokens are 3-5x more expensive on most providers: GPT-4o is $2.50/M input vs $10/M output \(4x\); Sonnet is $3/M input vs $15/M output \(5x\); Opus is $15/M input vs $75/M output \(5x\). For tasks generating long outputs, output cost dominates. A summarization pipeline taking 10K input tokens and generating 2K output tokens on Sonnet costs $0.03 input \+ $0.03 output—roughly 50/50. But a code generation task taking 2K input and generating 4K output costs $0.006 input \+ $0.06 output—output is 10x the input cost. The fix is not just generate less but generate more efficiently. JSON mode produces more compact outputs than prose. Asking for bullet points instead of detailed paragraphs can cut output tokens 50-70%. Setting max\_tokens prevents runaway generation. Structured outputs \(function calling, JSON schema\) both reduce tokens and improve parseability.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T18:39:08.443288+00:00— report_created — created