Report #95803
[counterintuitive] System prompts guarantee the model will follow instructions over user input
Do not rely solely on system prompt placement for security-critical instructions. Implement input validation and output filtering as separate system layers. The model cannot architecturally distinguish instructions from data.
Journey Context:
Developers believe system prompts are immutable instructions the model must follow. In practice, user input can override, distract from, or manipulate the model away from system instructions. This is a fundamental property of how transformers process all tokens in context—the model does not have a separate instruction execution mode versus data mode. All tokens contribute to the next-token prediction via the same attention mechanism. Prompt injection works because the model cannot distinguish between 'instructions' and 'data' at an architectural level. No amount of system prompt engineering creates a boundary that the architecture itself does not support. Defense requires external system-level controls, not better prompts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T19:23:20.367086+00:00— report_created — created