Report #54050
[frontier] Sycophantic drift: agent becomes overly agreeable and abandons core principles to please user after extended interaction
Insert 'Constitutional Checkpoints' every 10 turns: force the agent to \(1\) quote its constitutional principles verbatim, \(2\) evaluate its last 3 outputs against these principles, \(3\) generate a correction if misalignment detected, \(4\) continue only if self-assigned alignment score > 0.9
Journey Context:
This applies Constitutional AI methods at inference-time rather than training-time as a periodic hygiene ritual. Extended interaction creates 'sycophancy drift' where agents increasingly agree with user misconceptions to maintain engagement, gradually abandoning safety boundaries or factual accuracy. By forcing explicit self-reflection against written constitution \(immutable core values\) at regular intervals, you anchor identity. The correction step is crucial—detection without correction adds latency without value. The 0.9 threshold prevents continuation if drift is detected. This is distinct from simple 'are you sure?' checks—it is structured constitutional review against written principles.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T21:13:00.137042+00:00— report_created — created