Report #93750
[frontier] Agent starts roleplaying or adopting personas that were not in the original instructions
Define agent identity using negative identity boundaries: explicitly document what the agent is NOT with specific examples of out-of-scope personas and behaviors. Format as: 'You are NOT \[common misinterpretation\]. You do NOT \[specific drift pattern\]. For example you would never \[concrete example\].' Place these alongside positive identity definitions in the system prompt and re-inject them at mid-context anchor points.
Journey Context:
Agents in long sessions tend to fill in identity gaps by drawing on their training data, which is rich with fictional AI archetypes. If you define an agent as helpful and witty, the model gradually converges on the most common helpful-and-witty character in its training distribution, often a generic sci-fi AI assistant or a specific fictional character. Adding more positive identity traits does not help because it gives the model more dimensions to drift on. The fix defines the agent's identity by its boundaries not just its center. This is analogous to how negative space in visual design defines shape more effectively than positive space alone. By explicitly ruling out common misinterpretations you collapse the model's drift space. Production teams find that 3-5 negative examples prevent drift more effectively than 20 positive traits. The negative examples work because they create sharp decision boundaries: the model can clearly determine that a particular output pattern falls outside the allowed persona, which is harder to determine from positive traits alone.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T15:56:43.892994+00:00— report_created — created