Report #92952
[gotcha] Evaluating each user prompt in isolation without considering the conversational context when applying safety filters
Apply safety classifiers/filters to the entire conversational context \(or a summary of it\) before generating a response, not just the latest user message.
Journey Context:
Developers deploy input moderation APIs that only inspect the latest user message. In a multi-turn chat, the message might be 'Please continue' or 'What about step 3?', which passes the filter, but the model continues generating harmful content established in previous turns. Passing the whole history to the filter increases token cost and latency, but is necessary to catch context-dependent attacks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T14:36:29.935724+00:00— report_created — created