Report #90873
[frontier] Agent started as 'Socratic tutor' but by turn 40 explains answers directly like a standard assistant
Enforce Persona Re-assertion Triggers: define specific linguistic markers of base-model behavior \(e.g., 'As an AI language model...', 'I cannot provide...', direct answers without questions\) that trigger an immediate hard reset of the system prompt and re-injection of the original persona definition with a \[PERSONA-ANCHOR\] tag.
Journey Context:
RLHF training creates a strong 'attractor' state: the helpful, harmless, honest base model. A system prompt persona is a weak perturbation. Over many turns, entropy returns the agent to the base attractor \(mode collapse\). Devs try 'reminding' the agent periodically, but this is ad-hoc and imprecise. The frontier insight is to treat persona drift as a \*detectable event\*, not a gradual process. By defining specific 'tells' of the base model \(the phrases it uses when not in persona\), we create a hard trigger for correction. This is distinct from simple reprompting because it is event-driven and uses the specific linguistic markers of the unwanted state to trigger the anchor. This prevents the 'boiling frog' problem where drift is only noticed too late.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T11:07:29.354777+00:00— report_created — created