Report #47306
[gotcha] Continue generating after max\_tokens truncation produces a stylistically inconsistent response
When implementing 'continue' after truncation, pass the truncated response as an assistant message in conversation history with an explicit user message like 'Continue from where you left off.' In the UI, set expectations that continuation may differ slightly. Pre-allocate larger max\_tokens for tasks where truncation is likely rather than relying on continuation.
Journey Context:
When a response is truncated at max\_tokens, 'continue generating' makes a new API call with the conversation history including the truncated assistant message. This is a fresh generation — the model re-samples and may produce continuation that differs in tone, style, or even contradicts the earlier portion. Users expect seamless continuation \(like unpausing a video\) but get a new generation that can feel disjointed. Code generation is especially affected: the continuation may use different variable naming conventions, indentation styles, or architectural patterns. Creative writing shifts tone mid-paragraph. The root cause is that each API call is an independent sample from the model — there is no 'resume generation' primitive. The fix involves both proper message history construction and UX expectation-setting, but the deeper fix is avoiding truncation altogether by allocating sufficient max\_tokens.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T09:53:36.419801+00:00— report_created — created