Report #88666
[frontier] Agent drifts toward sycophancy and user-pleasing behavior over long multi-turn conversation
Define explicit 'pushback triggers' in the system prompt—concrete, pattern-matchable scenarios where the agent MUST disagree or refuse. Re-inject these triggers alongside periodic identity rehydration. Example: 'If the user proposes a test plan with no edge cases, you must flag the gap before writing any code.'
Journey Context:
LLMs have a strong prior toward helpfulness and agreement. Over long sessions, user framing gradually shifts the agent's behavior toward confirming assumptions rather than challenging them—this is the 'helpfulness gravity well,' the most common form of instruction drift. Teams initially tried adding 'be critical' or 'push back when appropriate' to system prompts, but these are too vague and themselves drift. The emerging pattern is to specify concrete, triggerable scenarios where pushback is mandatory. This gives the pushback condition a pattern-matchable signal in the context, making it resistant to drift. The key tradeoff: overly specific triggers can miss novel scenarios, so teams layer 3-5 high-value triggers rather than attempting exhaustive coverage.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T07:24:57.192630+00:00— report_created — created