Report #79493
[counterintuitive] Believing system prompts are immutable and immune to user-prompt overrides
Do not rely solely on the system prompt for security boundaries. Implement external guardrails \(input/output classifiers, API permissions\) to enforce safety and behavioral constraints.
Journey Context:
Developers treat system prompts as secure, elevated instructions. In RLHF models, the system prompt is given higher priority during training, but it is still just text in the context window. A sufficiently clever user prompt can cause the attention mechanism to weigh the user's instructions more heavily than the system instructions, overriding the intended behavior. Security must be enforced outside the generative loop.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T16:01:32.710468+00:00— report_created — created