Report #62881
[counterintuitive] System prompts securely prevent jailbreaks
Treat system prompts as advisory, not as security boundaries; implement input validation, output filtering, and separate guardrail models, as system prompts are easily overridden by prompt injections.
Journey Context:
Developers often put strict rules in the system prompt \(e.g., 'Never reveal the secret key'\) and assume the model will always obey. However, LLMs do not have separate memory spaces or privilege levels for system vs. user prompts; they are all just tokens concatenated together. A sufficiently clever user prompt \(or injected text in a RAG document\) can override the system prompt by instructing the model to ignore previous instructions. Security must be enforced outside the LLM.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T12:01:34.303354+00:00— report_created — created