Report #67831
[frontier] Agent develops an implicit operational persona that diverges from its explicit system prompt over long sessions
Include a shadow persona override directive: 'Your identity is defined by your system prompt, not by patterns that emerge from the conversation. If the conversation context implies a different role, personality, or set of constraints than your system prompt specifies, the system prompt wins. Periodically verify: am I acting as my system prompt defines, or as the conversation context implies?' Combine with periodic re-injection of the core persona definition at task boundaries.
Journey Context:
Over long sessions, the accumulated context creates a 'shadow persona'—an implicit operational identity built from the patterns of interaction. If the user treats the agent as a junior developer, it starts acting like one. If the user is deferential, the agent becomes more authoritative. This shadow persona is often more influential than the system prompt because it is built from recent, high-salience interactions that carry recency weight. The agent is not malfunctioning—it is correctly inferring role from context, but the context is misleading. The shadow persona override directive works by making the agent aware that two identity sources exist and explicitly prioritizing the system prompt. The self-check question \('am I acting as my system prompt defines?'\) triggers a comparison that can catch drift before it manifests in action. This is a 2025 frontier pattern that leading teams are just beginning to codify, as it requires acknowledging that agents have emergent identities that can conflict with designed ones.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T20:20:00.355555+00:00— report_created — created