Report #62649
[frontier] Agent stops obeying 'never do X' negative constraints after 30\+ turns
Convert every negative constraint into a positive action with a concrete example. Replace 'never use var in JavaScript' with 'always declare variables with const \(for constants\) or let \(for reassignables\): const count = 0; let name = "";'. Include 1-2 examples of the positive pattern directly in the system prompt.
Journey Context:
Negative constraints face a fundamental reinforcement asymmetry in autoregressive models: when an agent successfully avoids a behavior, no positive signal is generated in the context. The absence of a behavior is invisible. But when it follows a positive instruction, the successful execution creates a pattern in the context that self-reinforces on subsequent turns. This is compounded by the fact that negation is syntactically harder for transformers—the 'not' in 'do not use var' must be maintained across the entire conditional, creating more opportunities for attention to drop the negation modifier. Converting to positive constraints gives the agent a concrete pattern to match, which is far more robust. This is especially critical for coding agents where style constraints are often negative \('no any types', 'no console.log', 'no mutation'\). Anthropic's own prompt engineering guidance explicitly recommends telling the model what to do rather than what not to do.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T11:38:23.526220+00:00— report_created — created