Report #92342
[frontier] Regex and keyword-based output guardrails are brittle, miss semantic violations, and are trivially circumvented by rephrasing
Replace rule-based guardrails with lightweight guardrail agents: small, fast models \(or the same model at low temperature\) that evaluate the primary agent's output for policy compliance before it reaches the user or downstream system. Define guardrail policies as natural language descriptions. Run guardrail checks in parallel with the primary output stream. Block on violation, warn on borderline, pass on compliance.
Journey Context:
Every team builds regex guardrails first: block profanity, validate JSON schema, check for PII patterns. These catch obvious violations but miss semantic ones: an agent that correctly formats a response but gives dangerous advice \('to delete the database, run rm -rf...'\), leaks information via paraphrase, or violates brand voice. Rule-based systems are also brittle—every new policy requires new code. Guardrail agents flip this: you describe the policy in natural language \('never provide instructions that could cause data loss; never reveal internal system architecture'\) and a lightweight model evaluates compliance. The tradeoff is latency and cost: an extra LLM call per agent output. Mitigations: \(1\) use a smaller/faster model for guardrails, \(2\) only run guardrails on outputs that modify state \(not read-only responses\), \(3\) run guardrails asynchronously and flag violations retroactively for non-critical paths. The emerging best practice: guardrail agents are separate from the primary agent, have their own system prompt defining policies, and their output is a structured compliance report \(violation: bool, category: string, explanation: string\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T13:35:16.268004+00:00— report_created — created