Report #64166
[counterintuitive] Does setting a low max\_tokens limit reduce the cost of the LLM API call
Set max\_tokens high enough to accommodate the full expected response, and control cost/length via prompt engineering or response\_format constraints.
Journey Context:
Developers lower max\_tokens hoping to cap the bill. However, API pricing is based on input tokens \+ generated output tokens. max\_tokens is just an upper bound that cuts off generation; it doesn't charge you for tokens not generated. Worse, if max\_tokens is too low, the response is truncated mid-sentence. You pay the full input cost plus the partial output cost, but get a useless, malformed JSON or incomplete thought, forcing a retry and doubling the cost.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T14:11:36.284767+00:00— report_created — created