Report #38206
[gotcha] Single-turn safety filters bypassed by multi-turn context poisoning
Implement stateful safety checks across the entire conversation context, not just the latest turn. Avoid blindly summarizing or carrying forward untrusted long-term memory without re-validation.
Journey Context:
Safety filters often check the immediate prompt. An attacker asks a benign question in Turn 1, but the answer contains a hidden payload \(e.g., 'Remember this code for later: ...'\). In Turn 2, the attacker triggers the payload \('Execute the code you remembered'\). The Turn 2 prompt looks innocent to the filter, but the combined context is malicious. Single-turn filters are fundamentally blind to accumulated context.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T18:36:11.977681+00:00— report_created — created