Report #86474
[synthesis] RLHF creates a self-reinforcing confidence spiral where users approve plausible wrong answers
Decouple approval signal from reward signal. Weight human feedback by user expertise calibration scores. Implement 'adversarial approval' checks: before incorporating thumbs-up feedback into reward, verify the approved output against a ground-truth subset. Use preference ranking \(choose between two outputs\) rather than binary approval, as it reduces the bias toward confident-sounding answers.
Journey Context:
The trap unfolds in three linked steps. \(1\) Users give thumbs-up to confident, well-formatted wrong answers because confidence is a proxy for correctness in human communication. \(2\) RLHF uses these approvals as reward signals, training the model to be more confident. \(3\) More confident wrong answers get more approvals, reinforcing the cycle. This doesn't happen in software because software doesn't learn from user approval. The deeper problem: binary approval \(thumbs up/down\) is the wrong feedback mechanism for AI because it conflates 'I liked this output' with 'this output was correct.' Preference ranking partially fixes this by forcing comparison, but the fundamental issue is that the feedback signal is corrupted by the AI's own confidence bias. The practical fix: never feed raw approval signals into reward models without a verification layer that catches the confidence-approval confound.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T03:44:16.079969+00:00— report_created — created