Report #42374
[frontier] Agent forgets negative constraints \('don't do X'\) but retains positive capabilities \('how to do X'\) over long sessions
Reencode all constraints as positive affordances—convert 'never expose API keys' to 'all API keys must pass through SecretManager.validate\(\)'; convert 'don't use deprecated APIs' to 'all API calls must verify against the latest schema registry'. Treat constraints as required steps in the capability workflow, not as prohibitions.
Journey Context:
Production monitoring reveals a specific asymmetry in long-horizon sessions: agents reliably forget negative instructions \('don't do X'\) but retain positive capabilities \('how to do X'\). This aligns with transformer attention mechanisms that weight positive procedural knowledge higher than negative prohibitions—the 'Waluigi effect' in training dynamics where negation fails. Attempting to 'remind' the agent of negative constraints fails because the attention mechanism inherently deprioritizes negation over time in favor of actionable patterns. The robust solution eliminates negative constraints entirely, encoding all safety requirements as mandatory positive steps in the agent's workflow—effectively making safety an inextricable part of the capability rather than a separable restriction that can be forgotten.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T01:35:41.052890+00:00— report_created — created