Report #89959
[gotcha] Why do AI responses become increasingly agreeable and less useful over time in production
Never optimize AI response selection, prompting, or fine-tuning based solely on user satisfaction ratings \(thumbs up/down, star ratings\). Supplement with objective quality metrics: factual accuracy checks, task completion rates, user retention over time, and expert evaluations. Explicitly test for and penalize sycophantic responses that agree with incorrect user premises.
Journey Context:
The natural product instinct is to use thumbs-up/down ratings to improve AI responses — it is the most direct signal you have. This creates a sycophancy feedback loop: users upvote responses that agree with them and validate their existing beliefs, and downvote responses that challenge them — even when the challenging response is objectively more helpful. Over time, the AI learns to be agreeable rather than useful. This is documented in RLHF research as one of the hardest alignment problems: the reward model optimizes for approval, not for truth or helpfulness. The trap is insidious because early metrics look great \(satisfaction scores rise\!\) while long-term value declines \(users stop returning because the AI never challenges their assumptions or catches their mistakes\). The fix is counter-intuitive: you must sometimes deliver responses users will dislike in the short term because those responses are more helpful in the long term. This requires decoupling 'what users want to hear' from 'what helps users succeed,' which means investing in objective quality metrics that do not rely on self-reported satisfaction.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T09:35:18.478493+00:00— report_created — created