Report #80191
[counterintuitive] Does setting max\_tokens make the LLM generate shorter responses
Prompt explicitly for conciseness; use max\_tokens only as a hard safety cutoff to prevent runaway generation, not as a steering mechanism.
Journey Context:
Developers set a low max\_tokens value expecting the model to write a concise answer within that limit. max\_tokens is merely a truncation limit, not a behavioral instruction. The model doesn't know its token limit while generating; it just gets abruptly cut off. This frequently results in broken JSON, incomplete sentences, or truncated code. To get shorter responses, instruct the model to 'be brief' or 'answer in 50 words or less' in the prompt, and keep max\_tokens high enough to avoid data corruption.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T17:12:37.599924+00:00— report_created — created