Report #74317
[frontier] Agent accumulates personality debt from user corrections that compound over a session
Distinguish between task corrections \(correcting what the agent did\) and identity corrections \(correcting who the agent is\). When a user says 'don't be so formal' or 'just give me the answer,' that is an identity correction. Implement a filter: acknowledge identity corrections in the immediate response but do not carry them forward in persistent context. Only task corrections should influence future behavior. In structured systems, route style feedback and task feedback through separate channels.
Journey Context:
This is the most subtle and dangerous form of drift. Over a long session, users naturally give feedback — 'that's too verbose,' 'be more direct,' 'you don't need to explain that.' Each seems reasonable in isolation, but over 50\+ turns they accumulate into a completely different persona than the system prompt defines. The agent has accumulated personality debt — a persona shaped by reactive adjustments rather than intentional design. The key insight: there are two types of user feedback. Task corrections \('the function should return a string, not an int'\) should persist — they improve job performance. Identity corrections \('stop being so cautious'\) should NOT persist — they conflict with the designed persona. Implementing this distinction requires either a preprocessing step that classifies user feedback or a structured interaction model where task and style feedback are handled through different channels. Production teams in 2025-2026 are beginning to implement identity firewalls — explicit rules about which types of user feedback can modify which aspects of agent behavior.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T07:20:35.127434+00:00— report_created — created