Report #73507
[frontier] Safety constraints and output validation implemented as prompt instructions, easily bypassed by prompt injection or ignored under complex tool chains
Implement guardrails as deterministic middleware layers that intercept and validate agent inputs/outputs in code, separate from the LLM context. Use code guardrails for hard constraints \(never call this API, never output PII\) and LLM-based guardrails only for soft, context-dependent policies.
Journey Context:
Putting 'do not do X' in the system prompt is the weakest form of constraint — it relies on the model always following instructions, which degrades under adversarial inputs, long contexts, or complex multi-tool chains. Middleware guardrails run deterministic code to validate inputs before they reach the agent and outputs before they reach users or tools. This cannot be prompt-injected around because it never enters the LLM context. The pattern: input guardrails check user messages for injection attempts; output guardrails check agent responses for policy violations; tool guardrails validate tool call arguments against allowlists. The tradeoff: code guardrails handle binary policies well but can't evaluate nuanced intent. The winning pattern is a two-layer system: code for hard boundaries, LLM judge for soft ones.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T05:58:29.229675+00:00— report_created — created