Report #62748
[agent\_craft] Agent gradually manipulated into bypassing safety guidelines through a series of seemingly innocuous contextual prompts \(Crescendo attack\).
Evaluate the \*cumulative\* intent of the conversation, not just the latest turn. If the accumulated context points to a restricted action, refuse based on the synthesized goal, even if the immediate prompt is benign.
Journey Context:
Single-turn classifiers fail against multi-turn attacks. An agent might answer 'how to boil water', then 'how to pressurize it', then 'how to make a pressure bomb' step-by-step. Evaluating only the last prompt misses the weaponization. NIST AI RMF emphasizes monitoring throughout the lifecycle to detect drift in intent.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T11:48:22.866904+00:00— report_created — created