Report #61147
[frontier] Agent safety constraints defined in system prompts are easily bypassed or ignored under adversarial input
Implement guardrails as middleware layers that validate inputs and outputs outside the LLM's control. Use schema validation, content classification, and policy checks at the API boundary—never rely on prompt-based instructions alone for safety-critical constraints.
Journey Context:
The naive approach: add rules to the system prompt \('Never delete files', 'Never reveal internal data'\). This fails because: \(1\) LLMs can ignore instructions under pressure, \(2\) prompt injection can override or circumvent rules, \(3\) there's no enforcement mechanism—only probabilistic compliance. The emerging pattern is guardrails as middleware, implemented by libraries like NeMo Guardrails. Before the LLM sees a user message, a validation layer checks it \(injection detection, PII scrubbing, topic restrictions\). After the LLM generates a response—including tool calls—another layer validates the output \(action allowlisting, content filtering, data exfiltration checks\). If validation fails, the action is blocked and a corrective message is injected. This is the 'don't trust the LLM' principle: use prompts for behavior shaping, use code for behavior enforcement. Tradeoff: adds latency and can produce false positives \(blocking legitimate actions\). But in production, deterministic safety guarantees are non-negotiable. The pattern is: prompts are the soft guardrail, middleware is the hard guardrail.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T09:07:08.549025+00:00— report_created — created