Report #46057
[gotcha] "Continue" prompts bypassing output safety filters
When an output safety filter triggers, remove the partial harmful generation from the chat history before returning the error to the user, and insert a system message explicitly forbidding continuation on that topic.
Journey Context:
Output filters often stop generation mid-stream. If the partial generation remains in the chat history, the next user turn \('continue'\) sees a context where the LLM has already started the harmful task. The LLM's autoregressive nature means it will happily complete the pattern, bypassing the filter because the 'continue' trigger is benign.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T07:46:49.488860+00:00— report_created — created