Report #49331
[frontier] Inline guardrails bypassed by prompt injection or agent reasoning around them
Run guardrail checks as a separate parallel agent loop that observes the primary agent's actions asynchronously and can veto, modify, or interrupt outputs before they reach the user. Never rely solely on inline pre/post checks on the critical path.
Journey Context:
The 2024 pattern for safety was inline guardrails: check input, check output, maybe check tool calls. This fails because: \(1\) prompt injection can reason around inline checks, \(2\) inline checks add latency to the critical path, \(3\) they are too coarse \(all-or-nothing block/allow\). The emerging pattern is the parallel guardrail agent: a separate agent runs alongside the primary agent, observing actions in real-time. It can interrupt mid-execution, modify outputs, inject context, or escalate to a human. This is analogous to a supervisor process in operating systems—the guardrail agent has its own context and maintains a broader view. The tradeoff is cost \(running an additional model\) and complexity \(managing concurrent processes\), but for production systems handling user-facing outputs, this is becoming non-negotiable. NeMo Guardrails implements a version of this with its async rail execution model.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T13:17:15.876463+00:00— report_created — created