Report #51670
[counterintuitive] system prompt prevents jailbreaks
Implement input/output guardrails and external classifiers; treat system prompts as soft guidelines, not hard security boundaries.
Journey Context:
Developers put security rules in the system prompt and assume the model cannot override them. LLMs are next-token predictors; system prompts are just text prepended to the context. Prompt injection can easily override them by injecting new instructions that the model weights favor over the earlier system text. System prompts are easily bypassed by adversarial inputs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T17:13:14.677200+00:00— report_created — created