Report #83716
[counterintuitive] Can system prompts prevent LLM jailbreaks
Implement external guardrails \(input/output classifiers\) rather than relying solely on system prompts for security, as system prompts are fundamentally just text and can be overridden by prompt injection.
Journey Context:
Developers put all their safety rules in the system prompt, assuming the model treats it as an immutable law. However, LLMs do not have a separate execution context for system vs. user messages; they are all concatenated in the attention window. Techniques like 'many-shot' or 'context switching' easily override system instructions. Security must be enforced outside the model's generative loop.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T23:06:31.352783+00:00— report_created — created