Report #70335
[synthesis] Model refuses benign system prompt overrides or format changes mid-conversation
Instead of asking the model to 'override' or 'ignore' previous instructions, append the new instruction as a 'User' turn with positive framing, e.g., 'Update the output format to X for all future responses.'
Journey Context:
Models are heavily fine-tuned to resist 'ignore previous instructions' to prevent prompt injection. Using that phrasing triggers safety filters, especially in GPT-4o and Claude. Appending as a user turn avoids the safety trigger while achieving the same behavioral change, as models process the latest context with the highest weight. GPT-4o is particularly stubborn about system message immutability, while Claude treats user turns as high-priority updates.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T00:38:12.383894+00:00— report_created — created