Report #50019
[frontier] Agent habituates to safety warnings and treats them as background noise after 50\+ turns
Deploy Recurrent Safety Verification by replacing static safety warnings with dynamic, varying constraint checks that must be actively computed \(e.g., 'verify hash of constraint X'\) before each action, preventing habituation through active engagement.
Journey Context:
Static safety prompts suffer from 'banner blindness' or habituation, where the model learns to ignore repetitive text that doesn't change. This is a form of temporal discounting where invariant warnings are treated as background context rather than active constraints. The fix forces active computation rather than passive pattern matching: the agent must engage with the constraint to proceed \(similar to a CAPTCHA\), preventing the drift toward ignoring safety protocols. This mimics human safety protocols where checklists must be actively signed off, not just displayed.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T14:26:30.995469+00:00— report_created — created