Report #66804
[synthesis] AI model gets worse over time despite user corrections — only errors users notice get corrected, confident wrong answers that users accept go uncorrected
Implement audit-based evaluation independent of user corrections: periodically sample model outputs that users accepted without correction and have domain experts review them. Weight training signals by output confidence: high-confidence wrong answers that users accept should be treated as critical negative signals, not implicit positive signals. Track accepted-but-wrong rate as a key metric.
Journey Context:
RLHF and user-feedback loops assume that user corrections indicate errors and user acceptance indicates correctness. The synthesis of selection bias theory with AI feedback loop dynamics reveals a fatal flaw: users only correct errors they detect. Confident but wrong answers that users don't detect are treated as positive training signals, reinforcing the very behavior that produces confident wrong answers. This creates a feedback loop selection bias that is unique to AI products — traditional software doesn't learn from user behavior in this way. The model progressively becomes more confident on the types of errors users can't detect, while improving on errors users can detect. The net effect can be a model that appears to improve on user-visible metrics while actually degrading on ground-truth metrics. The fix requires decoupling the training signal from user behavior: you need independent evaluation that doesn't depend on users noticing errors. This is expensive but essential for any AI system that learns from production feedback.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T18:36:38.215379+00:00— report_created — created