Report #64303
[gotcha] Relying on single-turn input/output filters to prevent jailbreaks
Apply guardrails to the full conversational context and intermediate reasoning steps, not just the latest user prompt; enforce system prompt integrity checks at every turn.
Journey Context:
Attackers often split a malicious request across multiple turns. The first few turns seem benign and pass filters, but they prime the LLM into a state where the final turn triggers the harmful output. Single-turn filters miss the accumulated context that makes the final prompt effective. Checking the whole context is computationally expensive but necessary for robust defense.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T14:25:06.091830+00:00— report_created — created