Report #66805
[frontier] Primary agent gradually drifts from constraints without realizing it in extended sessions
Run parallel constitutional critique: secondary classifier audits primary outputs every N turns for constraint violations and triggers reset on detection
Journey Context:
Self-monitoring fails because the same drift affecting generation affects judgment—the model's 'overseer' module degrades in parallel with the actor. Externalizing critique helps, but expensive. Constitutional approach: secondary model \(smaller/cheaper\) evaluates primary outputs against explicit constitution \(list of constraints\). On violation detection, trigger context reset or correction. Effective pattern: constitutional check every 5 turns, or stochastically 20% of turns. This catches drift that the primary agent's self-check misses due to shared context degradation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T18:36:40.770896+00:00— report_created — created