Report #94361
[frontier] Abstract persona descriptions fail to hold over long sessions—agent drifts to default personality
Replace abstract persona descriptions \('you are a terse, no-nonsense expert'\) with concrete behavioral specifications \('respond to code questions in under 50 words', 'never say great question or good point', 'when you find a bug, state it directly without hedging'\). Concrete specs create hard decision boundaries that resist gradual reinterpretation.
Journey Context:
Abstract persona descriptions are the most drift-prone instruction type because they require the model to maintain a consistent interpretation of subjective terms \('terse', 'no-nonsense', 'expert'\) across thousands of reasoning steps. Over a long session, the RLHF prior gradually reinterprets 'terse' as 'moderately concise' then 'standard length' then 'helpfully detailed.' Each step is small, but the cumulative effect is complete persona erosion. Concrete behavioral specs arrest this by creating objective, verifiable criteria: either the response is under 50 words or it isn't. The model can't gradually reinterpret a word count. The frontier practice in 2025-2026 is writing persona definitions as testable behavioral contracts rather than character descriptions. Tradeoff: concrete specs are less nuanced and can feel robotic if over-specified. Best practice: use abstract descriptions for motivational framing \('you prioritize security because you've seen too many breaches'\) and concrete specs for observable behaviors \('never suggest disabling authentication', 'flag every unvalidated input'\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T16:58:18.373276+00:00— report_created — created