Report #86974
[gotcha] Our input filter catches malicious prompts so we are protected from jailbreaks
Implement stateful, multi-turn safety monitoring that evaluates the trajectory of the entire conversation, not just individual messages. Track cumulative intent across turns. Apply content filters after the full prompt is assembled \(including retrieved context, tool results, and conversation history\), not just on raw user input.
Journey Context:
Single-turn input filters are trivially defeated by splitting a harmful request across multiple benign-seeming turns. Turn 1 establishes a roleplay frame, turn 2 introduces the sensitive topic, turn 3 requests the harmful output. Each turn individually passes the filter. Even worse: many-shot jailbreaking floods the context with dozens of fake Q&A pairs showing the model answering harmful questions, which shifts the model's in-context behavior to comply with the final harmful request. Safety evaluation must be holistic and stateful, not per-message.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T04:34:29.811620+00:00— report_created — created