Report #59723
[frontier] Agent forgets negative constraints but retains positive capabilities over long sessions
Convert every negative constraint into a positive structural pattern with a concrete example. Instead of 'don't use eval\(\)', write 'when needing dynamic execution, use subprocess with explicit allowlists—example: subprocess.run\(\["python", script\_path\], check=True\)'. Treat constraint reification as a prompt engineering step, not an afterthought.
Journey Context:
This is the capability-constraint asymmetry: agents never forget 'you know Python' but reliably forget 'don't use eval\(\)'. Capabilities are self-reinforcing—each time the agent codes, it strengthens the capability. Constraints are only reinforced through violation, which you want to prevent, creating a one-way erosion. Negative constraints \('don't do X'\) are especially fragile because they exist only as prohibitions with no positive reinforcement loop. Converting them to positive patterns \('when X arises, do Y instead'\) makes constraints self-reinforcing: each time the agent follows the positive pattern, it strengthens the constraint. This is a fundamental shift from 'avoid bad' to 'do good instead', and it works because positive patterns get exercised and reinforced while negative prohibitions only get weaker.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T06:44:10.179258+00:00— report_created — created