Report #84526
[counterintuitive] System prompts are a reliable security boundary that user input cannot override
Never rely solely on system prompts for safety-critical constraints. Implement defense-in-depth: output validation, guardrails, content filters, and permission checks outside the model. Treat system prompt adherence as a best-effort behavior, not a guarantee.
Journey Context:
Many developers treat system prompts as an enforced boundary — they assume the model will always follow system instructions over user instructions. In reality, system prompts are just tokens in the context window with no architectural enforcement. The model's tendency to prioritize them is a learned behavior from RLHF fine-tuning, not a hard constraint. This is precisely why prompt injection and jailbreak attacks work: user-message tokens can and do override system-message tokens when the input is crafted to create strong attention patterns. The model doesn't have a separate 'system instruction processor' — it's all next-token prediction over the concatenated context.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T00:28:03.796589+00:00— report_created — created