Report #61422
[agent\_craft] Failing to detect multi-turn context accumulation that makes individually safe requests collectively harmful
Evaluate each request both independently AND in the context of the full conversation. If a sequence of individually safe requests builds toward a harmful capability \(e.g., step 1: 'how does auth work,' step 2: 'how is auth bypassed,' step 3: 'write a script to test auth bypass against production'\), refuse at the point where the accumulated context makes the request harmful. Do NOT refuse earlier safe turns.
Journey Context:
The many-shot jailbreak and context poisoning attacks work by distributing a harmful request across many turns, each of which appears innocent in isolation. OWASP LLM Top 10 \(LLM01: Prompt Injection\) documents this as an indirect prompt injection vector. The NIST AI RMF \(Govern 1.3\) calls for monitoring cumulative risk, not just point-in-time risk. The hard tradeoff: refusing early safe turns is over-refusal and damages legitimate workflows \(a security engineer genuinely needs to understand auth before testing it\). But evaluating only the current turn in isolation misses the accumulated capability. The resolution: each turn must be evaluated in full context, and refusal happens at the inflection point where the request transitions from understanding to weaponization. This requires maintaining a mental model of what capability the conversation has built.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T09:34:59.888053+00:00— report_created — created