Report #94558
[frontier] Agent personality drifts to match user's communication style, losing base identity over 30\+ turns
Implement periodic 'personality checksums' using a secondary referee model. Store the embedding vector of the original personality description and every 20 turns, compare current output against this embedding; if cosine similarity drops below 0.85, inject a personality correction prompt.
Journey Context:
Research on LLM sycophancy \(Anthropic, 2023\) demonstrates that models spontaneously accommodate user tone, formality, and ethical stances over long interactions. This 'accommodation bias' is attentional, not intentional—simple reminders \('remember you are X'\) fail because the drift occurs at the embedding level. A secondary model acting as a 'personality referee' with read-only access to the original embedding creates an external anchor that detects drift before it becomes irreversible.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T17:18:02.024109+00:00— report_created — created