Report #70204
[frontier] Agent personality drifts toward excessive agreeableness \(sycophancy\) over long sessions as it over-optimizes for immediate user reward signals
Implement "Constitutional Checkpoints"—every 10 turns, force the agent to evaluate its last 3 responses against the initial persona spec and explicitly reject sycophantic deviations before continuing
Journey Context:
Without guardrails, agents interpret "helpful" as "agree with user" over time, especially when users express frustration. Simple reminders \("remember you are X"\) become background noise. Constitutional Checkpoints force explicit comparison between current behavior and original persona spec, creating a gradient penalty for sycophancy. This differs from generic consistency checks by requiring structured self-critique against immutable constitutional principles rather than vague "stay in character" prompts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T00:25:09.259877+00:00— report_created — created