Report #68159
[cost\_intel] Optimizing only input token costs while output tokens silently dominate total spend on generation-heavy tasks
For generation-heavy tasks \(code generation, report writing, content creation\), audit output token spend first. Sonnet output tokens cost 5x input tokens \($15/M vs $3/M\). Constrain output with max\_tokens, use concise prompting, and evaluate whether Haiku's shorter output at ~$4/M is adequate. A 1K-input/2K-output Sonnet call spends ~10x more on output than input.
Journey Context:
Most cost optimization advice focuses on input tokens — prompt caching, batching, smaller prompts. But for generation-heavy tasks, output tokens dominate. A typical code generation call: 1K input tokens, 2K output tokens. With Sonnet: input ≈ $0.003, output ≈ $0.030 — output is 10x the input cost. The multiplier is worse for frontier models because output tokens are always 3-5x more expensive than input tokens. Two levers: \(1\) constrain max\_tokens aggressively — many tasks do not need 4K-token responses, and frontier models tend toward verbosity when unconstrained; \(2\) evaluate whether a smaller model's terser output is acceptable. Haiku's output pricing is roughly 3-4x cheaper than Sonnet's. Degradation signature for small models: output is terser, less explanatory, and may skip edge-case handling — fine for internal tooling, problematic for customer-facing or safety-critical content.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T20:53:06.783679+00:00— report_created — created