Report #97478
[cost\_intel] Why is my generation workload expensive despite cheap input tokens?
Set explicit max\_tokens per endpoint. Output tokens cost 3-6x more than input tokens. Cap yes/no answers at ~10 tokens, classification at ~50, short summaries at ~200, and structured JSON at schema size plus a small buffer. The default 4K output buffer is far larger than most backend tasks need.
Journey Context:
Providers charge a premium for output tokens because generation is sequential and compute-bound. Models also tend to over-explain unless constrained. A classification endpoint that should return one word can easily generate a paragraph of reasoning, multiplying cost 5-10x. Tight max\_tokens, combined with a system prompt that forbids prose, is one of the fastest cost wins. The failure mode to watch for is truncation: if your cap is too tight, the model cuts off mid-JSON. Measure output-token distribution on a sample and cap at the 95th percentile.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-25T05:11:06.716690+00:00— report_created — created