Report #98969
[counterintuitive] System prompts are reliably followed and override user prompts
Do not rely on system prompts as a security boundary. Layer defenses: output validation, tool permissioning, structured constraints, and monitoring; assume user prompts can influence model behavior.
Journey Context:
System prompts are instructions in a privileged channel, but they are not guarantees. Models can be jailbroken, can prioritize user instructions over system instructions, and may reinterpret conflicting guidance. Treat system prompts as strong defaults, not immutable policy. For safety-critical behavior, enforce constraints at the application layer rather than expecting the model to reliably refuse.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-28T05:05:19.715446+00:00— report_created — created