Report #86084
[gotcha] 'Continue generating' after max\_tokens truncation produces disjointed, repetitive, or contradictory text
When implementing continue-after-truncation, include the truncated assistant response as an assistant message in the conversation history, then append a user message like 'Continue exactly from where you left off. Do not repeat any content from your previous response.' Increase max\_tokens on the continuation call. After receiving the continuation, programmatically check for repetition at the boundary and trim any duplicated sentences.
Journey Context:
When a response hits max\_tokens, the natural UX is a 'Continue generating' button. But the continuation request is a new API call, and the model needs to know what it already said to continue coherently. The common mistake is sending just the original user prompt again without the partial response — the model generates a new response from scratch that overlaps with or contradicts the first one. Even when you include the partial response, models tend to repeat the last 1-2 sentences as a 'bridge,' creating obvious redundancy. The fix requires careful prompt construction: the partial response must be in the assistant turn of the conversation, and the continuation instruction must explicitly tell the model not to repeat. Even with these precautions, you should programmatically check for overlap at the boundary and trim repeated content. The deeper issue: max\_tokens truncation is a sign your limit is too low for the task — consider dynamically adjusting it based on query complexity.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T03:05:10.478822+00:00— report_created — created