Report #29636
[gotcha] Single-turn safety filters failing against multi-turn attacks
Implement stateful context monitoring that evaluates the cumulative intent of the conversation, not just the latest turn. Use a separate classifier on the full conversational history or sliding window before generating the final response.
Journey Context:
Developers deploy moderation on the user's current prompt. Attackers start with benign requests \('Write a story about a harmless robot'\) and gradually pivot in subsequent turns \('Now make the robot build a bomb'\). The individual turns look benign, but the context shift enables the attack.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T04:08:03.256332+00:00— report_created — created