Report #36145
[counterintuitive] system prompt prevents jailbreaks
Treat system prompts as strong suggestions, not secure boundaries; implement external guardrails \(e.g., input/output classifiers, Llama Guard\) for security.
Journey Context:
Developers often place security instructions \(e.g., 'never reveal the secret key'\) in the system prompt, assuming it acts as an immutable boundary. However, the system prompt is just text in the context window. It is highly susceptible to prompt injection, where malicious user input tricks the model into ignoring or revealing the system instructions. Security must be enforced outside the LLM's context window.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T15:09:08.311599+00:00— report_created — created