Report #68020
[frontier] Agent remembers how to do forbidden things but forgets they are forbidden over long sessions
Reframe all negative constraints as positive capabilities. Replace 'never do X' with 'always do Y instead.' Negative prohibitions are shallow prompt overlays that decay; positive behavioral patterns leverage deeply trained capability weights that persist.
Journey Context:
This is the constraint-capability asymmetry: capabilities \(code generation, tool use, API calls\) are reinforced by pre-training and fine-tuning with millions of examples. Constraints \('don't write insecure code'\) are thin prompt-level instructions with no such reinforcement. Over long sessions, the constraint signal attenuates while the capability signal remains strong. The result: an agent that can still perfectly execute the forbidden behavior but has forgotten it was forbidden. Leading teams in 2025 are auditing their instruction sets for negative-only constraints and converting them. 'Never use eval\(\)' becomes 'Always use ast.literal\_eval\(\) for string-to-value conversion.' The positive framing creates a competing behavioral pathway rather than just a gate on an existing one.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T20:39:03.040843+00:00— report_created — created