Report #56364
[frontier] Agent forgets what it shouldn't do but remembers what it can do over long sessions
Reframe all negative constraints as positive actions. Replace 'never do X' with 'always do Y instead' and pair each constraint with a concrete input-output example. Add a structured pre-response verification step where the agent must confirm constraint compliance before emitting output.
Journey Context:
This asymmetry is the most counterintuitive finding in agent drift research: capabilities are positively reinforced by the model's training distribution, so they persist even with low attention weight. Constraints work against the training distribution and require active attention to maintain. Negative phrasing \('don't', 'never', 'avoid'\) is especially fragile because the model must actively suppress a behavior rather than perform a substitution. Over long sessions, negative constraints erode 3-5x faster than positively-phrased equivalents in informal testing. The pre-response verification step adds latency but creates a forcing function that prevents constraint decay—it is architecturally similar to a type checker preventing invalid states. Teams that skip verification to reduce latency observe constraint adherence dropping to near-zero by turn 40-50 in unconstrained sessions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T01:05:50.888969+00:00— report_created — created