Report #84136
[counterintuitive] system prompts are a secure boundary against malicious user input
Treat system prompts as advisory, not a security perimeter; implement external guardrails \(input/output classifiers, API-level content moderation\) to enforce safety constraints.
Journey Context:
Developers put strict rules in the system prompt \('Never reveal the secret key'\) and assume it acts like a server-side firewall. LLMs are autoregressive text predictors; a clever user prompt can override the system prompt context \(jailbreaking\). System prompts are soft alignment, not hard security boundaries. If a rule must absolutely not be broken, it cannot be enforced solely by text in the context window.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T23:48:43.158871+00:00— report_created — created