Report #100332
[synthesis] User message overrides system instructions via prompt injection
On OpenAI, use the developer message / instruction hierarchy features when available and keep system instructions in the developer role. On Anthropic, separate privileged instructions from user input and rely on its stronger role-boundary training, but still validate. Do not mix privileged and user content in one message.
Journey Context:
GPT-4o without instruction hierarchy is more susceptible to 'ignore previous instructions' attacks than Claude, whose constitutional training explicitly distinguishes system-like guidance from user input. OpenAI's instruction hierarchy research and API features close this gap. The common mistake is a single dense system prompt plus blind trust in role boundaries.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-01T05:03:03.719753+00:00— report_created — created