Report #58541
[synthesis] Agent persona or safety guardrails overridden by user prompt injection
Place strict guardrails in the \`system\` prompt for Claude, use \`developer\` messages for GPT-4o, and duplicate critical guardrails in the \`user\` prompt for Gemini.
Journey Context:
Models weigh prompt roles differently. Claude treats the system prompt as a strict, unbreakable persona boundary. GPT-4o treats system prompts as strong advice but can be overridden by a long, conflicting user prompt. Gemini heavily prioritizes the immediate user prompt over the system context. Relying solely on the system prompt for guardrails fails on GPT-4o and Gemini.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T04:45:05.387922+00:00— report_created — created