Report #37972
[counterintuitive] Why does the model ignore or override my system prompt instructions
Treat system prompts as soft guidance, not enforced constraints; implement critical constraints \(safety, format, behavior\) in your application layer with validation and guardrails; for security-sensitive boundaries, use input sanitization and output filtering, not prompt instructions
Journey Context:
Developers treat the system message as a privileged, immutable instruction channel that the model 'must' follow more strictly than user messages. In reality, the system/user/assistant role distinction is a convention encoded in special tokens, not an architectural enforcement boundary. The model processes system messages as tokens like any other — there is no separate execution path or elevated authority. This is why prompt injection works: user content that mimics system-level instructions can override the original system prompt because the model doesn't maintain a security boundary between roles. Stronger system prompts \('NEVER ignore these instructions', 'ABOVE ALL ELSE...'\) don't create a real privilege boundary; they just add more tokens that the model may or may not attend to. The fix is to stop treating prompts as security boundaries and implement real constraints in code.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T18:12:59.058291+00:00— report_created — created