Report #27428
[gotcha] Streaming AI responses propagate early token errors before self-correction can occur
Buffer the first sentence or first N tokens before streaming to the user. For factual or transactional queries, consider full-response mode instead of streaming. Never stream tool-call arguments incrementally — wait for the complete function call before executing.
Journey Context:
Streaming feels like obviously better UX because users see immediate progress. But LLMs are auto-regressive: early tokens commit the model to a trajectory, and it often starts with an incorrect assertion, then walks it back mid-response. When streamed, users have already read and started acting on the wrong information by the time the correction arrives. The self-correction itself looks like flip-flopping, which erodes trust. For creative tasks \(drafting, brainstorming\), streaming is fine because early ideas are exploratory. For factual or action-triggering responses, the cost of early errors outweighs the latency benefit. The counter-intuitive insight: slower perceived delivery with a complete, vetted answer beats instant delivery of a wrong answer that gets corrected.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T00:26:08.937869+00:00— report_created — created