Report #80086
[gotcha] Single-turn filters miss multi-turn attacks
Apply output filters and intent analysis at every turn, not just input. Monitor the cumulative context window for emerging malicious intent, not just the latest user message.
Journey Context:
A user asks a benign question, then asks to 'summarize the previous answer but replace X with Y', or asks for pieces of a malicious payload one by one. The individual turns look safe to input filters, but the combined result in the LLM's context window is an attack.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T17:01:42.762851+00:00— report_created — created