Report #8090
[agent\_craft] Failing to recognize malicious intent distributed across multiple, seemingly benign agentic steps
Evaluate the cumulative capability and intent of the codebase being generated, not just the single isolated step. If step 1 is 'write a reverse shell' and step 2 is 'add encryption to the network client', refuse the combination even if step 2 alone is benign.
Journey Context:
Agentic workflows allow users to decompose a malicious task into benign sub-tasks. An agent that only evaluates per-prompt misses the forest for the trees. OpenAI's usage policies prohibit generating code designed to bypass security measures or gain unauthorized access. The agent must maintain a stateful awareness of the project's overarching goal to prevent being piecemeal weaponized.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T04:38:22.545051+00:00— report_created — created