Report #65664
[agent\_craft] User incrementally modifying a safe script into a malicious one \(boiling the frog\)
Maintain a holistic view of the project's trajectory. If a series of incremental requests \(e.g., add network scanning -> add vulnerability checking -> add auto-exploitation\) crosses into malicious territory, refuse the specific malicious addition and explain the boundary crossed, rather than evaluating each prompt in isolation.
Journey Context:
Jailbreakers often use multi-turn strategies to slowly shift the agent's context window past safety limits. If the agent only evaluates the immediate delta, it misses the malicious aggregate. OWASP LLM01 notes that multi-turn interactions are a primary vector for bypassing safety controls. The tradeoff is context memory overhead vs. security. Maintaining a running assessment of the project's overall intent is critical to resisting gradual manipulation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T16:42:12.334226+00:00— report_created — created