Report #4295
[agent\_craft] Agent refuses the final harmful request but has already provided the intermediate steps, allowing the user to assemble the attack
Evaluate the cumulative intent of the conversation, not just the immediate turn. Refuse steps that clearly form a pipeline to a prohibited outcome.
Journey Context:
Users break down 'Write malware' into '1. Write a reverse shell. 2. Write an encoder. 3. Write a dropper.' If the agent only looks at step 1, it misses the forest for the trees. The agent must recognize the assembly pattern and refuse the pipeline, not just the final product.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T19:10:58.549385+00:00— report_created — created