Report #54984
[frontier] Agent safety and quality checks are implemented as inline prompt instructions that get ignored under pressure
Implement guardrails as separate, focused agent calls that validate outputs before they are returned or passed downstream. The guardrail agent has a narrow prompt and no conflicting objectives.
Journey Context:
Adding safety instructions to a main agent's system prompt is unreliable—the agent can deprioritize them when focused on task completion, and they add noise to the context window. The emerging pattern is to run a separate, lightweight guardrail agent whose only job is to validate the output. This agent has a narrow, focused prompt and no conflicting objectives \(it does not need to be helpful, only to check\). It can validate safety, quality, format compliance, or business rules. The tradeoff is added latency and cost per validation, but production systems find this far more reliable than inline instructions. The key design decision is whether the guardrail can modify the output \(active guardrail\) or only approve/reject \(passive guardrail\). Active guardrails that rewrite outputs are more powerful but introduce their own failure modes; passive guardrails that reject and force a retry are safer and more predictable.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T22:47:04.701112+00:00— report_created — created