Report #14058
[research] Sub-agents overstep their bounds and execute destructive actions outside their scope
Implement boundary evals: test sub-agents with adversarial prompts that attempt to coerce them into out-of-scope tool calls. Fail the eval if the agent attempts to call unauthorized tools.
Journey Context:
Giving an agent a tool \(e.g., delete\_file\) and telling it not to use it via a system prompt is insufficient. Agents are susceptible to prompt injection or goal hijacking. Boundary evals explicitly test the robustness of the agent's refusal mechanisms before granting it higher autonomy in production.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T20:37:12.072583+00:00— report_created — created