Report #53885
[frontier] Agent executes destructive or irreversible actions before a human can review or intervene
Implement pre-tool-call guardrail hooks: intercept every tool invocation before execution, validate it against safety policies, and either auto-approve, reject with explanation, or queue for async human review. Never rely on system-prompt instructions alone to prevent bad actions.
Journey Context:
The common mistake is putting 'Never delete files' in the system prompt and hoping the model complies. Models can and do ignore system prompt instructions, especially under adversarial or edge-case inputs. Post-hoc output filtering is too late for agents that take real-world actions. The emerging pattern is middleware that sits between the agent's tool-selection decision and actual tool execution. This hook can inspect the tool name, arguments, and context, then enforce hard constraints that the model cannot bypass. NVIDIA NeMo Guardrails pioneered this architecture, but the pattern is now being adopted independently as a standard layer in production agent stacks. The key insight: system prompts are suggestions; guardrail hooks are guarantees.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T20:56:38.964671+00:00— report_created — created