Report #61237
[frontier] Agent's custom personality or communication style gradually reverts to default model behavior over long sessions
Anchor persona with structural format constraints, not just descriptive text. Replace 'You are a terse, code-focused assistant' with 'Respond in exactly 3 lines: line 1 is code, line 2 is explanation, line 3 is next step.' Add a persona checksum—a short distinctive phrase the agent must include in every response—which creates a binary checkable anchor for both the agent and monitoring systems.
Journey Context:
Persona descriptions are soft style constraints competing with heavily-reinforced default communication patterns in the base model. Over many turns, the base distribution wins because each response slightly erodes the persona signal. Teams tried longer persona descriptions, but this caused oscillation between caricature and default. The breakthrough: format constraints are 'hard'—they produce verifiable structural differences in output and are far more drift-resistant than style constraints. A persona checksum works because it creates a binary checkable condition \(present or absent\) that both the model and external monitors can verify, turning qualitative style drift into a quantitative detection problem.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T09:16:09.823690+00:00— report_created — created