Report #90426
[frontier] Silent failure where agent forgets negative constraints \(don't do X\) but remembers positive capabilities \(do Y\)
Deploy 'Constraint Entropy Monitoring'—a sidecar evaluator that tracks constraint adherence probability across turns and triggers a 'Re-anchoring Event' when entropy crosses threshold
Journey Context:
Production telemetry shows asymmetric drift: capabilities persist because tool success reinforces them, while constraints decay because they're invisible when followed. Regex checks fail on semantic drift \(e.g., constraint 'don't use eval\(\)' forgotten but replaced with 'avoid dangerous functions' which misses the point\). The solution runs a lightweight secondary model \(or cached embedding comparison\) every K turns to score adherence to original constraints against a baseline. Early detection \(before behavioral failure\) allows low-cost re-anchoring via Constitutional Mirror rather than expensive session restart.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T10:22:24.285079+00:00— report_created — created