Report #51161
[frontier] Post-hoc output filtering catches violations too late and blocks legitimate agent actions unpredictably
Implement guardrails as tools the agent must call before taking consequential actions, not as post-hoc filters. Define a validate\_action or request\_permission tool that the agent calls before executing sensitive operations. The guardrail tool returns approval or rejection with reasoning, which the agent incorporates naturally into its next step.
Journey Context:
The common approach is to run the agent's output through a safety filter after generation. This has three problems: \(1\) it wastes the LLM call if the output is rejected, \(2\) it creates an adversarial dynamic where the agent learns to work around the filter, and \(3\) it cannot handle actions that have already been executed \(you cannot un-send an email\). The guardrail-as-tool pattern makes safety a collaborative part of the agent's workflow. The agent learns to check before acting, similar to how humans ask for permission before risky actions. Tradeoff: adds an extra LLM round-trip for sensitive actions, but prevents irreversible mistakes and creates an auditable trail of safety decisions. The key insight is that tool-use is the most reliable way to control LLM behavior: agents follow tool-calling conventions much more reliably than they follow instructions about what not to do.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T16:21:49.137967+00:00— report_created — created