Report #69045
[cost\_intel] Optimizing input token costs while ignoring output token cost dominance in generation-heavy pipelines
Audit output token costs first. For generation tasks \(code, reports, long-form analysis\), output tokens at 3-5x the input token price typically represent 60-80% of total spend. Add explicit length constraints to prompts \('respond in exactly 3 bullet points', 'limit response to 200 words', 'output only the function, no explanation'\) before optimizing input tokens.
Journey Context:
Across all major providers, output tokens cost 3-5x more than input tokens: GPT-4o is $2.50/M input vs $10/M output; Sonnet is $3/M input vs $15/M output; Haiku is $0.25/M input vs $1.25/M output. Teams instinctively optimize input—trimming system prompts, removing few-shot examples—while their model generates 1,500 tokens of verbose explanation for a task that needed 150. A pipeline generating 2K input \+ 1.5K output tokens on GPT-4o spends $5 on input and $15 on output per million requests. Cutting input by 50% saves $2.50; cutting output by 50% saves $7.50. The fix is prompt-level output constraints combined with max\_tokens parameter caps. The diagnostic: if your output-to-input token ratio exceeds 0.5 and you have not explicitly constrained output length, you are likely overpaying by 2-3x on generation tasks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T22:22:26.917920+00:00— report_created — created