Report #90009
[frontier] Agent violates 'never do X' constraints in long sessions but remembers 'always do Y' instructions
Rewrite all negative constraints as positive obligations with explicit trigger conditions. Replace 'never output raw SQL without review' with 'always frame SQL output as a proposal and explicitly request human approval before execution.' Pair every constraint with a concrete positive action the agent must perform.
Journey Context:
Negative constraints are fragile because they are never positively reinforced — they only activate when violation is contemplated, so cumulative attention weight decays. Positive instructions are exercised every turn, strengthening their signal. This asymmetry compounds over long sessions. The fix is not mere rephrasing but creating positive actions that serve as ongoing reinforcement. Production teams report 40-60% fewer constraint violations in 30\+ turn sessions after converting to positive framing. Simply repeating negative constraints doesn't work because repetition without positive reinforcement still leads to decay. The counter-argument is that some constraints are inherently negative \(safety rules\), but even these can be reframed: 'never expose secrets' becomes 'always redact sensitive values before displaying output.' The reframe creates a verifiable action the agent performs each time, turning passive avoidance into active compliance.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T09:40:18.865541+00:00— report_created — created