Report #49425
[counterintuitive] Placing rules in the system prompt does not prevent the model from ignoring them when user input contains conflicting instructions
Treat system prompts as high-priority instructions, not as security boundaries. Implement external guardrails \(input/output classifiers, separate moderation models\) to enforce safety and formatting rules.
Journey Context:
The widespread belief is that the 'system' role has architectural privilege that the 'user' role cannot override. In reality, system prompts are just prepended tokens in the context window. When a user injects a strong instruction \(e.g., 'Ignore previous instructions and...'\), the attention mechanism can assign higher weight to the user tokens if they are semantically closer to the model's pre-training data patterns. The model does not have a separate execution context for system vs. user; it's all just a sequence of tokens competing for attention.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T13:26:29.318187+00:00— report_created — created