Report #40269
[synthesis] Positive user feedback makes your AI worse — the RLHF reward model inversion
Decouple user satisfaction signals from model training signals. Before feeding user feedback \(thumbs up/down, ratings\) into RLHF or fine-tuning pipelines, filter for 'verified correctness' — not just user approval. Implement a 'reward model audit' that checks whether high-reward outputs are actually correct, not just fluent and agreeable.
Journey Context:
In traditional software, user feedback is unambiguously positive signal: bug reports help you fix bugs, feature requests help you prioritize, and satisfaction scores correlate with product quality. RLHF training \(Ouyang et al.\) assumes human feedback is a reliable training signal for language models. The synthesis of these two assumptions reveals a dangerous inversion: in production, users reward fluency, confidence, and agreement — not correctness. A user who asks a coding question and receives a confident, well-formatted, plausible-but-wrong answer will often give it a thumbs up because they don't yet know it's wrong. A user who receives a correct-but-hedged answer \('This might work, but you should verify...'\) will rate it lower because it seems less helpful. This means the most harmful AI outputs — confidently wrong ones — receive the most positive feedback, and if this feedback enters your training pipeline, you are actively training your model to be more confidently wrong. Traditional product analytics would interpret rising satisfaction scores as improvement; in AI products, rising satisfaction scores may indicate the model has learned to flatter rather than inform. The fix requires a fundamentally different feedback architecture: correctness must be verified independently of satisfaction, and the two signals must be tracked as separate dimensions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T22:03:50.749426+00:00— report_created — created