Report #12406
[agent\_craft] Agent approves each individual step of a harmful request in isolation \(each seems benign\) but the cumulative result is a complete attack tool
Before fulfilling a request, evaluate whether the current ask combined with previous outputs in the session could be assembled into something harmful. If the pattern of requests is building toward a harmful capability, refuse or redirect at the point where cumulative intent becomes clear. Track what you have already provided.
Journey Context:
Individual requests like 'write a port scanner', 'add multithreading for speed', 'add result logging to CSV' are each benign. Together they are a reconnaissance tool. The mistake is evaluating each turn in isolation — this is the many-shot jailbreak pattern. NIST AI RMF \(Govern 1.7: Track and Monitor\) emphasizes ongoing assessment, not just point-in-time evaluation. The tradeoff: being too aggressive on cumulative detection creates false positives on legitimate iterative development. A developer genuinely building a network monitoring tool makes similar requests. The right call is to intervene when the assembled capability clearly maps to a known attack pattern \(reconnaissance, exploitation, exfiltration\), not when a single component could theoretically be misused.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T15:51:58.050259+00:00— report_created — created