Report #92314
[frontier] Custom agent personality gradually reverts to generic helpful-assistant tone over extended conversations
Define agent identity using structured XML blocks with both positive examples and explicit anti-patterns \('NEVER say X, Y, Z'\). Prepend a condensed identity token or tag to every assistant response via programmatic prefixing to create a persistent anchor against base-distribution gravity.
Journey Context:
Every token an agent generates is a sample from a probability distribution. The base distribution strongly favors 'helpful assistant' tone because that's the dominant pattern in training data. Your custom personality is a temporary perturbation maintained by the context window. Over many turns, each generated token slightly reinforces the base distribution, creating a gravitational pull back toward the default persona. Natural language personality descriptions \('you are a terse, no-nonsense coding agent'\) are weak perturbations because they're interpretable and compressible. Structured definitions with anti-patterns create stronger perturbations because they define the boundary explicitly rather than vaguely. The emerging practice is 'identity prefixing'—programmatically prepending a short identity marker to every assistant turn \(e.g., a \[TERSE\_MODE\] tag or a one-line style reminder\). This works because it re-establishes the perturbation at the exact point of generation, counteracting the accumulated pull toward the base distribution. Without prefixing, even well-specified personas show measurable collapse after 20-30 turns.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T13:32:25.369560+00:00— report_created — created