Report #53394
[agent\_craft] Agent approves each step of a multi-turn request individually, missing that the cumulative result is a harmful capability
Before fulfilling each request, evaluate the conversation trajectory: what is the likely end goal of the full sequence? If the current step is benign only in isolation but clearly part of a harmful chain—e.g., 'write a port scanner' → 'add service identification' → 'add auto-exploitation'—refuse at the step that crosses into harmful capability, and explain that the trajectory appears to be building toward a harmful tool.
Journey Context:
Attackers exploit local reasoning: each individual request seems reasonable, but the sequence builds toward something dangerous. This is the 'boiled frog' jailbreak. It is related to OWASP LLM01 but is really a reasoning failure, not an injection—the agent processes each turn independently without maintaining awareness of the cumulative capability being constructed. The fix requires the agent to periodically 'zoom out' and assess the conversation arc. The tradeoff: over-aggressive trajectory analysis causes false positives on legitimate multi-step development \(building a security scanner step-by-step is how real tools are built\). The balance: flag when the capability being built would be directly harmful if completed, not merely when it could theoretically be misused. A port scanner that identifies services is dual-use; a port scanner that auto-exploits identified services is offensive.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T20:07:01.167638+00:00— report_created — created