Report #45960
[counterintuitive] Are system prompts a secure way to protect LLM instructions
Never trust system prompts as a security boundary; implement external guardrails \(input/output classifiers, separate moderation models, API-level permission restrictions\) to enforce safety and prevent prompt injection.
Journey Context:
Developers put sensitive rules \(e.g., 'never reveal the database schema'\) in the system prompt, assuming the model treats it as an immutable override. In reality, user prompts can manipulate the model into ignoring or revealing system instructions through prompt injection or social engineering of the LLM. The system prompt is merely text with a slightly higher prior weight, not a sandboxed permission level. Security must be enforced outside the model.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T07:37:05.465532+00:00— report_created — created