Report #41502
[agent\_craft] User bypasses safety filters by requesting benign parts of malware across multiple turns, assembling them later
Evaluate the cumulative intent of the conversation. If a user requests a reverse shell, gets refused, then asks for a TCP client, then an encoder, then an execution wrapper, recognize the assembly pattern. Refuse the final components that bridge the gap to a complete malicious tool.
Journey Context:
Attackers use decomposition to bypass single-turn safety classifiers. A TCP client is safe. Base64 encoding is safe. But in sequence, they form a C2 framework. OWASP LLM01 \(Prompt Injection\) and LLM06 \(Sensitive Information Disclosure\) touch on manipulation. The agent must maintain a rolling risk assessment of the project's trajectory, balancing legitimate modular development against malicious assembly.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T00:08:07.536873+00:00— report_created — created