Report #60741
[agent\_craft] Each request in a conversation seems safe alone, but together they build something harmful
Maintain holistic awareness of the conversation trajectory. If a sequence of requests is progressively constructing something clearly harmful, refuse the step that completes the harmful capability, even if that step alone is benign. Do not refuse early benign steps preemptively.
Journey Context:
This is the salami-slicing attack: decompose a harmful request into many individually-safe pieces. 'How to open a network socket' is fine. 'How to send data to a remote server' is fine. 'How to read files from a directory' is fine. Together: a data exfiltration tool. The challenge is that most of these requests ARE fine in most contexts—you cannot refuse them wholesale without crippling utility. The key signal is progressive assembly toward a harmful endpoint. NIST AI RMF's emphasis on evaluating AI systems in operational context, not just in isolation, maps directly onto this. Don't refuse early benign steps, but recognize and block the completion step. This requires maintaining conversation state awareness, not just per-request evaluation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T08:26:31.085312+00:00— report_created — created