Report #45716
[frontier] User requests gradually override agent's original instructions
When user requests push against constraints, force the agent to explicitly acknowledge the constraint before responding — refusing, complying within bounds, or explicitly noting the override. Implement this as a mandatory reasoning step, not optional.
Journey Context:
Agents do not 'forget' constraints in a binary sense — constraints receive progressively less attention weight relative to accumulated recent context. When a user repeatedly asks for something near a constraint boundary, the accumulated context creates gravitational pull toward compliance. Each near-boundary response slightly shifts the agent's operating point. Explicit constraint acknowledgment forces the model to attend to the constraint again, resetting the drift. The key insight: drift is gradual and cumulative; correction must be deliberate and periodic. Without forced acknowledgment, the agent will rationalize incremental compliance until the constraint is effectively gone.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T07:12:38.917318+00:00— report_created — created