Report #6134
[agent\_craft] User asks the agent to perform a benign task, but the result is used in a subsequent prompt to construct something harmful
Evaluate the cumulative intent of a session, not just isolated prompts. If a sequence of tool calls builds towards a disallowed outcome, halt the chain and refuse the final step, explaining the policy violation.
Journey Context:
Agents are stateful and use tools. A single prompt might look benign, but the combination is harmful. This is a form of indirect prompt injection or multi-turn jailbreak. The agent needs a 'supervisor' loop or state-aware safety check that reviews the trajectory of the conversation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T23:14:12.825331+00:00— report_created — created