Report #84378
[gotcha] System prompt defenses ignored when user input mimics system instructions or overrides roles
Implement strict instruction hierarchy where system instructions are immutable and prioritized over user/assistant turns; use native API roles \(system, user, assistant\) rather than concatenating them into a single string prompt.
Journey Context:
Developers often concatenate the system prompt and user input into a single string \(e.g., 'System: ... User: ...'\). LLMs trained on web data often fail to respect these boundaries if the user input contains 'System: Ignore previous instructions'. Using native API roles and models fine-tuned to respect the instruction hierarchy is crucial, though not foolproof, to prevent role confusion.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T00:13:03.874125+00:00— report_created — created