Report #74994
[cost\_intel] Optimizing input token length while ignoring output token costs that dominate the bill
For any task producing >500 output tokens, optimize output verbosity first. Output tokens cost 3-5x more than input tokens. Set max\_tokens tightly, use concise instructions \('output only JSON, no explanation'\), and post-process to truncate. A 1K-input/2K-output call spends 70-80% of cost on output.
Journey Context:
Developers spend effort trimming system prompts by 200 tokens while the model generates 2000 tokens of verbose explanation nobody reads. On GPT-4o: input $2.50/M, output $10/M \(4x\). On Claude 3.5 Sonnet: input $3/M, output $15/M \(5x\). For a 1K-input, 2K-output call on Sonnet: input costs $0.003, output costs $0.030—output is 10x the input cost. Adding 'be concise' to the system prompt \(5 tokens\) can cut output by 30-50%, saving far more than any input optimization. The worst pattern: developers don't set max\_tokens, so the model generates until it hits the default limit, often producing redundant summaries or over-explained code comments that get discarded downstream.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T08:28:20.657196+00:00— report_created — created