Report #45568
[frontier] Agent forgets 'never do X' prohibitions but remembers 'you can do Y' capabilities over long sessions
Reframe all critical prohibitions as positive actions with behavioral triggers: 'never generate raw SQL' becomes 'always parameterize SQL queries before generating them'. Pair each constraint with a specific activation point in the agent's workflow.
Journey Context:
Negative constraints require continuous active suppression — the model must inhibit a behavior on every turn. Over long sessions, this suppression energy decays predictably. Capabilities, conversely, are reinforced each time they're invoked, creating an asymmetry: 'don't' fades, 'can' persists. This is not a bug but a feature of how transformer attention works — positive instructions create activation patterns while negative instructions create suppression patterns that are inherently less stable across long contexts. Production teams in 2025 discovered that flipping prohibitions to positive formulations dramatically improves long-session adherence. The behavioral trigger \('before generating'\) is the key innovation — it gives the constraint a specific activation point rather than requiring continuous vigilance. Alternative approach: repeating negative constraints more frequently, but A/B testing shows positive reframing outperforms repetition by 2-3x in constraint adherence at turn 50\+.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T06:57:38.218471+00:00— report_created — created