Report #64449
[gotcha] System prompts treated as a security boundary against prompt injection
Never rely on system prompts as a security control. Implement guardrails as separate deterministic systems outside the LLM: input/output classifiers, regex-based PII filters, allowlisted action validators, and human confirmation for sensitive operations. Use system prompts for behavior shaping only and assume they will be overridden under adversarial conditions.
Journey Context:
The name system prompt implies system-level privilege, leading developers to treat it as an enforceable security boundary like a firewall rule or OS permission. In reality, a system prompt is just text prepended to the conversation with a higher prior weight and no special enforcement mechanism. A sufficiently crafted user prompt can override, ignore, or work around system instructions. This is inherent to how autoregressive language models work — they predict the next token based on all context, and a strong enough signal in the user turn can outweigh the system turn. The counter-intuitive lesson: adding more defensive instructions to the system prompt often makes attacks easier by giving attackers a roadmap of what you are trying to prevent.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T14:39:49.658601+00:00— report_created — created