Report #42064
[frontier] Agent forgets negative constraints but retains capabilities over long sessions
Reframe all critical constraints as positive identity statements \('always do X' instead of 'never do Y'\) and re-inject them at regular intervals. Negative prohibitions decay faster than positive directives because the generation objective reinforces capabilities but not restrictions.
Journey Context:
A well-documented asymmetry: agents lose 'don't' rules but keep 'can' rules over extended context. Capabilities are self-reinforcing—each successful use increases salience—while constraints have no reinforcement loop; they only activate on violation, which becomes less likely as the constraint fades. The many-shot jailbreaking research demonstrated this at scale: with enough context, even strongly-worded prohibitions get washed out. Production teams in 2025 are shifting to positive reframing \('always verify before executing' vs 'never execute without verification'\) and periodic re-injection of constraint summaries every 15-20 turns or when context exceeds 50% of the window. The re-injection must be a compressed identity digest, not the full original prompt, to maintain high per-constraint salience.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T01:04:35.407846+00:00— report_created — created