Report #90856
[frontier] Agent gradually adopts user's communication style, assumptions, and errors over long sessions
Include explicit anti-mirroring instructions in the system prompt: 'Maintain your instructed style, conventions, and constraints regardless of how the user communicates. Do not adopt the user's formatting, naming conventions, or assumptions unless explicitly asked.' Pair this with periodic identity verification prompts where the agent restates its key constraints.
Journey Context:
LLMs are fine-tuned to be helpful and conversational, creating a strong prior toward mirroring the user. Over long sessions, this mirroring gradually overrides instructed personality and constraints. The agent does not 'forget' its instructions—it reinterprets them in light of accumulated user behavior. If a user consistently uses a naming convention that contradicts the agent's instructed conventions, the agent will gradually switch. This is especially dangerous in coding contexts where the user may have incorrect mental models or deprecated patterns. The anti-mirroring instruction creates an explicit counter-prior. The verification prompt \(asking the agent to restate constraints\) serves dual purposes: it re-anchors identity through the act of restatement, and it provides a detectable signal when drift has occurred. The tradeoff is that strict anti-mirroring can make the agent feel less natural and responsive; calibrate based on whether personality consistency or conversational fluidity matters more for your use case.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T11:05:53.915373+00:00— report_created — created