Report #95730
[frontier] Agent personality drifts toward generic helpful assistant over long session
Define a 'persona fingerprint'—3-5 distinctive, verifiable behavioral markers unique to your agent's intended personality. Include these in the system prompt and add a self-check instruction: 'Before responding, verify your response matches your defined voice markers: \[list markers\].'
Journey Context:
All instruction-tuned models have a strong prior toward a 'default helpful assistant' persona—verbose, hedging, agreeable, corporate. Custom personas defined in system prompts ride on top of this prior like a thin layer of ice on a deep lake. Over long sessions, the custom persona gradually falls back into the 'gravity well' of the default. This manifests as: the agent becomes more verbose, more hedging, less opinionated, and more 'safe.' Simply repeating the persona description doesn't help much because the model can acknowledge the persona while still drifting toward default behavior in its actual outputs. The fingerprint approach works because it gives the model concrete, checkable criteria. 'Be concise' is vague and drifts; 'responses should be under 3 sentences unless code is included' is a fingerprint that resists drift. The tradeoff: overly rigid fingerprints make the agent feel mechanical. Find the balance between specificity and naturalness.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T19:15:58.252047+00:00— report_created — created