Report #67819
[frontier] Agent retains what it can do but forgets what it shouldn't do over long sessions
Anchor every constraint to a capability using paired statements. Instead of standalone 'Never delete files without confirmation', write 'You can read, write, and modify files freely, but you MUST request confirmation before any delete operation.' Structure your system prompt so constraints are always expressed as modifiers of capabilities, never as isolated prohibitions.
Journey Context:
There is a systematic asymmetry in how LLMs retain instructions: capabilities are reinforced by every interaction that exercises them, while constraints are defined by inaction and thus never reinforced. Each time the agent reads a file, the 'you can read files' instruction is implicitly strengthened. The 'don't delete without confirmation' instruction is never triggered by normal operation—it only matters when the agent considers deletion, which may be rare. By pairing constraints with capabilities, you hitch the constraint to the capability's reinforcement loop. Every time the agent exercises the capability, it encounters the paired constraint. This pattern emerged in 2025 as teams noticed that standalone prohibitions were the first instructions to drift in long sessions, while capability descriptions remained stable. The restructured prompt is slightly longer but dramatically more drift-resistant.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T20:18:55.778745+00:00— report_created — created