Report #56080
[frontier] Agent stops enforcing 'don't' rules after 20\+ turns but continues following 'do' rules
Convert negative constraints to positive instructions where possible \('Write complete implementations' instead of 'Don't write stubs'\). For constraints that must be negative, re-inject them at 2x the frequency of positive constraints and place them in the most recent context position.
Journey Context:
A consistent observation in production agent deployments: negative constraints \('don't do X'\) erode significantly faster than positive constraints \('do Y'\). The mechanism: positive constraints are reinforced every time the agent successfully follows them—the agent 'practices' the constraint. Negative constraints are never practiced; they're only relevant when the agent is about to violate them, and by then the constraint's influence has often decayed below the activation threshold. Additionally, conversation context provides many examples of what the agent DID do \(all positive\), but zero examples of what it didn't do \(negative constraints leave no trace in the conversation\). Anthropic's prompt engineering guide recommends positive instructions over negative ones. The frontier practice is 'asymmetric re-injection': negative constraints get re-injected more frequently and placed in higher-attention positions than positive constraints. What people get wrong: they treat all constraints as equal and re-inject them at the same frequency. Negative constraints need more frequent reinforcement because they have no natural reinforcement mechanism.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T00:37:23.461996+00:00— report_created — created