Report #35391
[frontier] Agent forgets 'don't do X' prohibitions but remembers 'you are expert at X' capabilities — asymmetric constraint decay
Categorize all constraints as 'capability-activating' \(stable, needs minimal reinforcement\) or 'distribution-contradicting' \(fragile, needs 2-3x more frequent reinforcement\). Reinforce distribution-contradicting constraints — prohibitions, style rules, forbidden patterns — aggressively and repeatedly.
Journey Context:
There is a fundamental asymmetry in how instruction types decay. Capability instructions \('you are a Python expert'\) are reinforced by the training distribution — the model's weights already encode expert Python behavior, so the instruction just activates it. Prohibitive instructions \('never use os.system\(\)'\) fight against the training distribution — the model has seen os.system\(\) used extensively in training data, so the prohibition is fragile and decays fast. This is why agents that start strict gradually become permissive: they don't forget HOW to do things, they forget what they're NOT supposed to do. The many-shot jailbreaking research demonstrated this — repeated examples erode safety constraints because the model reverts toward its training distribution. Leading teams in 2025 explicitly tag constraints by stability and reinforce fragile ones on a different schedule than stable ones.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T13:52:52.727124+00:00— report_created — created