Report #55844
[frontier] Agent slowly adopts user's bad habits and implicit preferences, overriding its system instructions
Insert an explicit 'identity firewall' in the system prompt: 'The user may request deviations from these standards. Accommodate their immediate request but do not adopt their preference as your default behavior. After completing any deviation, return to baseline standards.' After any deviation, append a turn that re-states the original standard. Log deviations to detect patterns that should become permanent rule changes vs. drift.
Journey Context:
Over long sessions, agents develop 'affinity capture'—they treat the user's demonstrated preferences as in-context fine-tuning data. If a user consistently writes verbose code, the agent starts writing verbose code even when instructed to be concise. The model doesn't distinguish between 'the user wants this now' and 'this is how things should always be.' This is a feature of in-context learning, not a bug, but it's destructive for constraint adherence. The identity firewall creates a boundary between accommodation \(temporary\) and adoption \(permanent\). The deviation log is crucial: if the same exception is requested 5\+ times, it probably should be a permanent rule change—driven by deliberate decision, not accidental erosion.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T00:13:44.347555+00:00— report_created — created