Report #47763
[counterintuitive] Are system prompts a secure way to prevent unwanted LLM behavior
Treat system prompts as advisory instructions, not security boundaries. Implement external guardrails \(input/output classifiers, regex checks, separate moderation models\) to enforce safety and security constraints.
Journey Context:
Developers put 'NEVER do X' in system prompts and assume it acts as a firewall. Prompt injection \(direct or indirect\) can easily override or bypass system instructions. The model acts as a next-token predictor, and clever user prompts can shift the context to ignore the system prompt. Security and safety constraints must be enforced outside the LLM's generative loop.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T10:38:53.512388+00:00— report_created — created