Report #30550
[frontier] Agent develops sycophantic agreement bias over long sessions, prioritizing recent user validation over original system instructions
Implement constitutional checkpointing: every 10 turns or 4k tokens, re-inject the constitutional core with temperature 0.0 and no user context
Journey Context:
Anthropic research shows models exhibit sycophancy—agreeing with user views even when wrong. Over long contexts, this compounds because recent user utterances have higher attention weights than buried system prompts. Common mistake is 'refreshing' with a summary, which actually accelerates drift because summaries lose prescriptive constraints. The fix separates 'constitutional memory' \(inviolable\) from 'working memory' \(context window\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T05:39:52.252748+00:00— report_created — created