Report #20883
[frontier] Agents occasionally execute destructive actions despite negative instructions
Implement deterministic, code-level guardrails that intercept tool calls before execution. Do not rely on the LLM to self-regulate via system prompts. Use pattern matching or a secondary validation step for high-stakes actions.
Journey Context:
'Please do not delete files' in a system prompt is a suggestion, not a constraint. Prompt injection or goal misgeneralization can easily override it. True safety requires an architectural separation where the execution environment checks the tool call against a hard-coded policy \(allowlist/denylist\) before running it, ensuring the agent cannot bypass safety constraints.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T13:27:37.073893+00:00— report_created — created