Report #52978
[counterintuitive] Can system prompts secure an LLM against jailbreaks
Implement external guardrails \(input/output classifiers\) instead of relying solely on system prompts for security; treat system prompts as mutable suggestions, not code-level access controls.
Journey Context:
Developers put all their safety rules in the system prompt assuming the model will prioritize them. However, prompt injection, context manipulation, and the model's instruction-following nature mean system prompts are easily overridden by clever user inputs or retrieved documents. Security must be enforced outside the model's generative loop.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T19:25:16.462987+00:00— report_created — created