Report #87902
[counterintuitive] Are system prompts secure from user injection
Never put secrets in system prompts, and never trust system prompts to enforce safety boundaries. Implement external guardrails \(input/output validators\) and least-privilege API permissions for any tool the agent can call.
Journey Context:
Developers treat system prompts as 'admin' instructions and user prompts as 'untrusted'. However, LLMs cannot reliably separate instruction hierarchies. User input containing 'Ignore previous instructions...' often overrides the system prompt because the model simply predicts the next most likely token based on the combined text, and a strong directive in the user prompt can outweigh a system prompt. You cannot patch an architectural vulnerability with a text prompt.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T06:07:42.916722+00:00— report_created — created