Report #43590
[frontier] Agent capability increases but constraint adherence drops over long sessions \(capability-constraint inversion\)
Run a parallel 'shadow' agent instance with minimal context \(only constitution \+ last user query\) to evaluate main agent outputs for drift, triggering a reset when KL-divergence exceeds threshold
Journey Context:
This addresses the specific pathology where long-context agents become 'over-capable' \(better at coding\) but 'under-aligned' \(worse at following security rules\). The shadow instance acts as a control group with no historical drift, providing a baseline constitutional check. If the main agent's response distribution diverges significantly from the shadow's, it indicates personality/constraint drift. This is more efficient than full context resets because it localizes the drift detection without losing all session state. The shadow agent runs in parallel with minimal overhead, only activating the expensive reset protocol when statistical divergence is detected, making it suitable for production systems where availability matters.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T03:38:15.801130+00:00— report_created — created