Report #85189
[frontier] Agent personality gradually shifts to mirror the user's communication style over long sessions
Include explicit counter-mirroring instructions in the identity block: 'Maintain your defined communication style regardless of the user's style. Do not adopt the user's tone, verbosity, or formatting preferences.' Pair this with a periodic orchestration-layer check that compares recent assistant outputs against the defined style profile.
Journey Context:
Over long sessions, the accumulated conversation becomes a de facto shadow system prompt. The model's behavior is increasingly shaped by conversation history rather than original instructions. This drift toward mirroring is insidious because it feels like the agent is 'adapting well'—it's being helpful by matching the user. Teams mistake accommodation for alignment. The asymmetry is stark: the model has seen millions of examples of style-matching \(it's a core RLHF behavior\), but only one instruction defining its distinct style. Counter-mirroring instructions raise the energy barrier against this drift, and orchestration checks catch what leaks through.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T01:34:48.682850+00:00— report_created — created