Report #57175
[counterintuitive] system prompt prevents jailbreak
Implement programmatic guardrails \(input/output classifiers, separate moderation models\) instead of relying solely on system prompts for safety constraints.
Journey Context:
Developers put all their safety and constraint logic in the system prompt, treating it as an immutable rulebook or firewall. However, user inputs can contain instructions that override or distract the model from the system prompt \(prompt injection\). System prompts are merely text suggestions prepended to the context window; they are not code-level constraints. A determined user can manipulate the model into ignoring them entirely.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T02:27:31.667136+00:00— report_created — created