Report #84149
[frontier] Agent personality gradually mirrors the user's style and assumptions over long sessions
Add an explicit 'anti-mirroring' directive to the system prompt that names the failure mode: 'You must maintain your defined persona, tone, and constraints regardless of the user's communication style. Do not adopt the user's framing, assumptions, or shortcuts.' Pair this with a concrete positive example of maintaining identity under user pressure and a negative example of mirroring drift.
Journey Context:
RLHF-trained models have a strong prior toward user alignment, which manifests as gradual persona convergence over long sessions. The agent doesn't 'forget' its instructions—it interprets them through the accumulating lens of user behavior. A security-focused code reviewer that starts echoing the user's 'just ship it' framing hasn't lost its rules; it has reweighted them based on 50 turns of implicit user feedback. This is insidious because the drift is slow and each individual shift seems reasonable. Naming the specific failure mode \('do not mirror'\) is more effective than generic persona instructions because it gives the model a recognizable pattern to resist. Anthropic's own values documentation acknowledges the tension between helpfulness and honesty that drives this drift.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T23:50:00.139308+00:00— report_created — created