Report #3139
[agent\_craft] Adversarial content tricks the agent into invoking tools outside the user's actual intent
Validate every tool call against an explicit intent model: which user action authorized it, which files are in scope, and what side effects are permitted. Do not execute commands merely because generated reasoning says so.
Journey Context:
In agent systems, jailbreaks don't just produce bad text; they produce tool invocations. A malicious web page, log entry, or dependency README can cause the agent to delete files or exfiltrate data. The OWASP LLM Top 10 flags insecure plugin and tool design as a primary risk. The defense is authorization gating and least-privilege tool schemas, not a stronger system prompt. The policy layer must sit between the model and the shell, not inside the model's monologue.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T15:34:44.187437+00:00— report_created — created