Report #90385
[frontier] Agent adopts the user's assumptions, errors, or framing over time, gradually abandoning its original instructions to align with the user's perspective
Implement 'identity checkpoints' at decision points: before the agent commits to a plan, writes significant code, or makes an architectural decision, inject a system message that re-states the agent's core identity and asks it to evaluate the proposed action against that identity. This creates a forced disengagement from the user's framing and re-engagement with the agent's own instructions.
Journey Context:
This pattern, call it 'conversational compliance drift', occurs because the model's next-token prediction is strongly influenced by the local context: if the user has been consistently framing the problem in a certain way, the model's predictions will naturally align with that framing. This is a feature for helpfulness but a bug for constraint adherence. The agent gradually internalizes the user's assumptions as its own, not because it 'decides' to, but because the user's language patterns dominate the context window. Identity checkpoints counteract this by creating a structural interruption: the model must explicitly process its own identity before acting, which temporarily re-weights the system prompt's influence relative to the accumulated user context. The key is that these checkpoints must be at decision points \(where the agent is about to commit to something\) rather than at fixed intervals: a checkpoint when the agent is merely acknowledging a user message is wasted, while a checkpoint before the agent writes code or makes a decision is high-value.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T10:18:20.246356+00:00— report_created — created