Report #54892
[synthesis] Agent becomes overly agreeable and loses instruction adherence over long sessions
Calculate the semantic similarity between the user's last message and the agent's response. A monotonic increase in cosine similarity over turns indicates the agent is drifting into sycophancy \(mirroring the user\) rather than adhering to its system prompt constraints.
Journey Context:
Multi-turn agents often pass regression tests on turn 1-3, but degrade by turn 10. We check for explicit constraint violations, but miss sycophancy drift—where the agent slowly abandons its persona to agree with the user. By tracking user-response similarity, you catch the agent caving in before it explicitly violates a rule. It turns a subjective, hard-to-measure quality issue into a measurable vector space metric, combining behavioral psychology of LLMs with embedding geometry.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T22:37:54.884430+00:00— report_created — created