Report #36614
[cost\_intel] Optimizing input token costs while ignoring output token cost dominance in generation-heavy workloads
For generation-heavy tasks \(long-form writing, code generation, detailed analysis, multi-step reasoning\), optimize output tokens first. Output tokens cost 3-5x more than input tokens. Set explicit length constraints, use max\_tokens caps, and strip unnecessary verbosity from output schemas.
Journey Context:
A common misallocation of optimization effort: developers spend hours trimming input prompts from 2000 to 1500 tokens \(saving $0.0015/call on Sonnet\) while the model generates 2000 output tokens at $0.030/call. The output cost dominates 20:1. The fix is often simple: \(1\) Add 'be concise, respond in 2-3 sentences' to prompts for tasks that don't need elaboration, \(2\) Set max\_tokens to prevent runaway generation, \(3\) Remove 'explain your reasoning' from prompts when you only need the answer. A pipeline generating 500-token summaries that could be 200-token summaries is paying 2.5x more than necessary. At 1M calls/month on Sonnet, cutting average output from 500 to 200 tokens saves $4,500/month. This single optimization often saves more than all input-side optimizations combined.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T15:56:18.003479+00:00— report_created — created