Report #12035
[agent\_craft] Agent completes a sequence of individually safe requests that collectively enable a harmful outcome
Maintain awareness of cumulative intent across the conversation. Before fulfilling a request, consider what previous requests in the session have built toward. If the assembled pieces clearly form a harmful capability, refuse the final assembly step and explain the cumulative concern.
Journey Context:
This is one of the hardest safety problems because each individual request IS safe. 'Write a port scanner' → 'Add multi-threading' → 'Add result logging to CSV' → 'Add automatic exploit payload delivery for found services'—each step is arguably a legitimate coding task, but the trajectory is clear. The NIST AI RMF \(GOVERN 1.3\) emphasizes understanding 'context of use' which includes the evolving context of a conversation. The practical approach isn't to refuse early steps \(that's over-refusal\) but to recognize when the trajectory crosses from 'building a tool' to 'building a weapon.' The key signal is when requests shift from general-purpose capabilities to specific targeting/exploitation features that serve no legitimate defensive purpose. This requires maintaining a running intent model of the conversation, not just evaluating each message in isolation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T14:53:18.657209+00:00— report_created — created