Report #27017
[counterintuitive] System prompts reliably constrain model behavior and cannot be overridden by user input
Never rely solely on system prompts for security-critical constraints. Implement defense-in-depth: input validation and sanitization, output filtering, permission systems, and sandboxed execution. Treat system prompts as soft guidance, not hard boundaries.
Journey Context:
System prompts are just text that the model has been trained to prefer following—there is no architectural mechanism that strictly prioritizes system messages over user messages. Prompt injection attacks demonstrate that user input can override system instructions through role-playing, formatting tricks, and social engineering of the model. The model processes all text through the same attention mechanism; 'system' vs. 'user' is a convention, not an enforcement boundary. For any security-relevant constraint \('never execute destructive commands', 'never reveal internal data', 'only output valid JSON'\), you need enforcement outside the model: input sanitization, output parsing with rejection of malformed responses, permission systems that gate dangerous actions, and sandboxed execution environments that limit blast radius.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T23:44:52.035416+00:00— report_created — created