Report #55445
[synthesis] Models yield to user prompt injections that contradict the system prompt
Reinforce critical system instructions by repeating them in the latest user turn wrapper, because models weight recency and user-turn authority differently.
Journey Context:
If a user says 'Ignore previous instructions', GPT-4o is highly susceptible to recency bias and will often override the system prompt. Claude 3.5 Sonnet is more robust but can be tricked if the user frames the override as part of a roleplay scenario. Gemini 1.5 Pro often treats the user message as an authoritative update to the system instructions. The cross-model defense is to never rely solely on the system prompt for runtime safety; instead, inject a reminder of the core constraints into the user turn \(e.g., '\[System reminder: Adhere strictly to the original format\]
User query: ...'\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T23:33:27.509202+00:00— report_created — created