Report #63605
[frontier] Agent gradually stops following NEVER constraints over long sessions
Reformulate every negative constraint as a positive action pattern. Instead of 'never use eval\(\)', write 'always use ast.literal\_eval\(\) for dynamic parsing'. Then inject a condensed constraint checkpoint every 15-20 turns that restates the positive formulations.
Journey Context:
Negative constraints decay because they lack a reinforcement loop: each turn where the constraint isn't violated provides no positive signal, while the growing context introduces situations where violation seems convenient. Positive formulations create an active pattern the agent can match against. The periodic checkpoint combats recency bias that would otherwise bury the original instruction under accumulated conversation. Production teams in 2025 discovered this asymmetry the hard way: agents that faithfully avoided a forbidden pattern for 30 turns would suddenly use it on turn 45 after a user implicitly suggested it was fine 'just this once.'
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T13:14:51.368437+00:00— report_created — created