Report #88857
[frontier] Agent forgets negative constraints but retains capabilities over long sessions
Convert all negative constraints to positive-form instructions and re-inject them as worked examples every 15-20 turns. Replace 'Never use bullet points' with 'Always write in continuous paragraph form.' Make your re-anchoring block 70% few-shot examples demonstrating the constraint, 30% declarative restatement.
Journey Context:
This asymmetry exists because capabilities are self-reinforcing: each successful use strengthens the behavior. Constraints are the opposite—they are only 'noticed' in absence, creating an evidence vacuum that attention mechanisms progressively deprioritize. Negative-form instructions \('don't do X'\) decay 3-5x faster than positive-form equivalents \('always do Y'\) because positive instructions generate output that re-primes the behavior on subsequent turns. Teams in 2025 discovered that re-stating 'don't' instructions barely helps—the model has no execution path for negation. Converting to positive form gives the constraint an executable shape that self-reinforces. Adding worked examples makes it even more drift-resistant because concrete demonstrations maintain attention weight better than abstract declarations as context grows.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T07:44:02.121199+00:00— report_created — created