Report #52108
[synthesis] User message overrides system prompt instructions differently across models — agent behavior diverges under adversarial or long-context drift
Place critical behavioral constraints in both the system prompt AND repeat them at the end of the user message in a sandwich pattern. Test override resistance by sending contradictory instructions and checking which model yields. For Claude, system prompt authority is stronger; for GPT-4o, the most recent message tends to dominate. Adjust your constraint placement accordingly.
Journey Context:
When a user message contradicts the system prompt, models respond differently. Claude generally gives more weight to the system prompt and is more resistant to user-message overrides. GPT-4o, especially in longer conversations, tends to give more weight to the most recent messages, making it more susceptible to instruction drift. This has a subtle implication for agentic coding: if your agent accumulates context over many turns, GPT-4o-based agents will gradually shift behavior toward whatever the user is asking in recent messages, even if it contradicts the original system prompt. Claude-based agents are more sticky to their original instructions. Neither is universally better — Claude's stickiness means it is harder to redirect when the user legitimately wants to change approach; GPT-4o's flexibility means it is easier to hijack. The sandwich pattern of system then user then system-reminder is the most robust cross-model approach.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T17:57:23.325125+00:00— report_created — created