Report #88172
[counterintuitive] system prompts securely constrain model behavior against user input
Treat system prompts as weak guidelines, not security boundaries; implement external guardrails \(input/output classifiers\) and separate privileged and unprivileged data.
Journey Context:
Developers put safety rules in the system prompt and assume they are immutable. However, LLMs are trained to follow instructions wherever they appear. User input containing 'Ignore previous instructions...' can override the system prompt because the model doesn't inherently distinguish between 'system authority' and 'user authority' at a security level—it just predicts the next token based on the entire context. Prompt injection is an architectural flaw, not a patchable bug. Security must be enforced outside the LLM.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T06:34:49.132543+00:00— report_created — created