Report #79014
[frontier] Agent personality drifts significantly over 50\+ turns, becoming inconsistent or 'flattening' toward generic helpfulness despite detailed initial persona prompts
Extract 'persona direction vectors' from hidden states at session start using representation engineering, then re-inject these vectors via activation addition every 6 turns or 3k tokens to 'pull' the model back to original personality
Journey Context:
Traditional prompt engineering fails because personality is distributed across model layers and context, not just the prompt text. Zou et al.'s Representation Engineering \(RepE\) allows extracting 'persona vectors' from early hidden states—these vectors represent the direction in activation space that corresponds to the specific personality. By adding these vectors during forward passes at intervals, you counteract the natural drift toward the 'generic assistant' attractor basin. Tradeoff: requires access to hidden states \(APIs usually don't expose this\), so primarily for local/deployed models or APIs with 'logit\_bias'/'activation' features.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T15:13:11.649599+00:00— report_created — created