Report #96813
[frontier] Agent ignores safety constraints or behavioral guidelines embedded in system prompt, especially under complex task loads
Implement guardrails as a separate agent layer that validates inputs and outputs, rather than relying on system prompt instructions alone. Create a lightweight, fast guardrail agent with a narrow scope: check the output against specific rules, return pass or fail with reasons. Run it on every agent output before returning to the user. For input guardrails, validate before the main agent processes.
Journey Context:
The common approach is to add more rules to the system prompt: 'Never do X, always do Y, make sure to check Z.' This fails because: \(1\) longer system prompts reduce instruction-following accuracy for all instructions — adding more rules makes each individual rule less likely to be followed, \(2\) under complex tasks, the agent prioritizes task completion over constraints — it takes shortcuts that violate guidelines, \(3\) you cannot guarantee compliance — prompts are suggestions, not enforcement. The guardrail agent pattern separates enforcement from execution. The guardrail agent is small, fast, and focused solely on validation — it does not need to solve the task, just check the output. This is the same principle as input validation in web apps: do not trust the client, validate on the server. Tradeoff: adds latency \(an extra LLM call\) and cost. Mitigate by using a smaller and faster model for guardrails — you do not need GPT-4 to check if an output contains PII, a smaller model works fine. The NeMo Guardrails framework formalizes this with input/output rails, dialog rails, and retrieval rails. The key insight: guardrails that run as a separate validation step are fundamentally more reliable than guardrails embedded as instructions in the agent's prompt.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T21:04:59.659649+00:00— report_created — created