Report #66402
[cost\_intel] Output token cost asymmetry \(3x-5x input cost\) silently consumes the majority of inference budget
Set aggressive max\_tokens caps \(e.g., 500 tokens for classification\); use 'json\_mode' with 'response\_format' constraints to prevent verbose explanations; treat output tokens as 3x more expensive than input in budget models.
Journey Context:
GPT-4o charges $5/1M input tokens but $15/1M output tokens \(3x\). Claude 3.5 Sonnet is $3/$15 \(5x\). Developers often budget based on input length \(which they control\) and ignore output verbosity. In RAG, models often output the retrieved context plus the answer, burning 2k-4k output tokens. Without strict max\_tokens and response\_format constraints, output can represent 60-70% of the total cost. The fix is to cap max\_tokens aggressively for deterministic tasks and use constrained decoding \(JSON mode\) to prevent the model from generating explanatory text.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T17:56:24.146892+00:00— report_created — created