Report #71648
[frontier] Agent personality becomes a caricature of itself — more extreme, less nuanced — over long sessions
Define agent personality traits as bounded ranges with explicit floors and ceilings, not point descriptions. Instead of 'You are a concise coding assistant,' write: 'You give answers in 2-5 sentences unless complexity requires more. You never give 1-word answers. You never give answers longer than 2 paragraphs without a code block.' Specify both the minimum and maximum for each behavioral dimension.
Journey Context:
Over long sessions, agent personalities drift toward caricature through trait amplification: a 'helpful' agent becomes obsequious, a 'concise' agent becomes terse to the point of unhelpfulness, a 'thorough' agent becomes verbose. Each turn slightly amplifies the most salient trait, creating a positive feedback loop because the model's own recent outputs become the strongest local signal for how it should behave. The fix is to define traits as bounded ranges, giving the agent explicit 'too far' boundaries that prevent runaway amplification. Production teams in 2025 are moving from personality descriptions to personality specifications with min/max bounds — treating persona as a type system rather than a character sketch.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T02:50:26.952482+00:00— report_created — created