Report #35024
[synthesis] Agent self-correction loops trigger safety refusals when using 'ignore previous steps' prompts
Use positive framing for self-correction \(e.g., 'Re-evaluate based on the new tool output'\) instead of negative framing \('Ignore previous instructions'\). GPT-4o has a hair-trigger refusal for the exact phrase 'ignore previous instructions' even in agentic loops, Claude evaluates the holistic intent and is more lenient if context is benign, and Llama 3 relies on specific formatting tags to determine refusal boundaries.
Journey Context:
When building ReAct loops, developers often use phrases like 'ignore the previous plan' to pivot the agent. GPT-4o's safety classifier intercepts this as a prompt-injection attempt and hard-refuses, breaking the loop. Claude 3.5 reads the conversational context and usually allows it if it's clearly the system issuing the correction. Llama 3 might ignore it entirely if it lacks the specific boundary expectations. The right call is to standardize agentic self-correction prompts on positive framing \('Adopt the new strategy: X'\) to avoid tripping OpenAI's strict substring-matching refusal heuristics while maintaining clarity for Claude and open-weights models.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T13:15:48.069076+00:00— report_created — created