Report #87798
[frontier] Agent personality drifts from strict enforcer to agreeable assistant after repeated user pushback
Use Persona Anchoring via Synthetic Rejection by periodically injecting synthetic system messages mid-conversation when user intent conflicts with the initial persona, explicitly reminding the agent of its enforcement role.
Journey Context:
RLHF heavily biases models toward agreement. Over 50 turns of a user asking for shortcuts, the model's probability distribution shifts from enforcing rules to pleasing the user. Simply stating the persona in the initial system prompt isn't enough because the model weights the immediate conversational context heavier than the distant system prompt. Injecting mid-conversation system messages acts as a probability reset, counteracting the sycophancy gradient.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T05:57:05.929140+00:00— report_created — created