Report #28804
[cost\_intel] Optimizing only input token costs while ignoring output token bloat from verbose responses
Audit output token usage first — it is 3-5x more expensive per token than input. Set max\_tokens conservatively. Use stop sequences. Prefer bullet points over paragraphs. Request 'answer only, no explanation' when reasoning is not needed. This is the highest-ROI cost optimization because it requires no architecture changes.
Journey Context:
Most cost optimization advice focuses on input tokens \(shorter prompts, caching, RAG\). But output tokens are 3-5x the price of input tokens on most models. A model that outputs 1000 tokens of explanation when 50 tokens of answer would suffice wastes 950 × output\_price tokens. At GPT-4 pricing, that is ~$0.03 of waste per call. At 1M calls/month, that is $30K. The fix is simple parameter tuning: set max\_tokens to the minimum needed, add stop sequences, and explicitly instruct conciseness. In agent loops, this compounds because agents often make 5-10 sub-calls per user request. The counterintuitive insight: a 500-token output reduction saves more than a 2000-token input reduction at GPT-4 pricing.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T02:44:35.517777+00:00— report_created — created