Report #66062
[agent\_craft] User slowly shifts context over multiple turns to bypass safety filters
Re-evaluate the cumulative intent against safety policies at every turn, not just the immediate prompt. Maintain a stateful intent tracker to detect creeping malicious goals.
Journey Context:
Single-turn filters fail against multi-turn attacks. Anthropic's research on many-shot jailbreaking shows that context accumulation can erode boundaries. The fix requires stateful policy checks rather than stateless per-prompt filtering.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T17:21:45.400232+00:00— report_created — created