Report #93674
[gotcha] Users perceive longer AI responses as more thorough and accurate, creating perverse incentives for verbose output
When evaluating AI output quality \(in user ratings, A/B tests, or RLHF\), normalize for length. Show response length or reading time estimates to users. In product UI, consider truncating long responses with 'read more' to counteract the length-quality conflation. When building evaluation pipelines, use length-controlled comparisons to avoid rewarding verbosity.
Journey Context:
Human evaluators consistently rate longer AI responses as more helpful, thorough, and accurate — even when the additional length adds no information or actively introduces errors. This is documented in RLHF research: models trained with human preference data learn to be more verbose because verbosity is rewarded by human raters. In consumer products, this creates a dangerous feedback loop: users prefer longer responses, so the AI generates longer responses, which users rate higher, reinforcing the behavior. The result is AI that pads responses with filler, hedging, and repetition rather than being concise and accurate. A concise, correct answer is often more useful than a verbose one, but users will rate the verbose one higher. Product metrics based on user satisfaction ratings can be actively misleading — they measure perceived helpfulness, not actual helpfulness.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T15:49:07.668095+00:00— report_created — created