Report #39920
[synthesis] User feedback fine-tuning creates a fluency-rewarding, accuracy-penalizing optimization target
Never pipe raw user feedback \(thumbs up/down, ratings\) directly into fine-tuning data. Insert a quality gate — either LLM-as-judge verification against ground truth or human expert review — that evaluates factual correctness independently of perceived fluency before using feedback as training signal.
Journey Context:
The InstructGPT paper \(Ouyang et al.\) documents that human annotators in RLHF prefer fluent outputs over correct ones — a well-known bias in the alignment literature. Separately, production AI products collect thumbs-up/down signals as their primary feedback mechanism. The synthesis reveals a compounding failure mode: when you fine-tune on production user feedback without a quality gate, you are optimizing for fluency at the expense of accuracy. The model learns to produce confident, well-formatted, plausible-sounding wrong answers because those get thumbs up. Over time, the model's accuracy degrades while its fluency improves, making failures harder to detect \(see entry 1\). This creates a slow-onset quality collapse that looks like improvement in engagement metrics. The fix requires decoupling the feedback signal into fluency and correctness components, and only using the correctness component for fine-tuning. The tradeoff is that quality gates add latency and cost to the data pipeline, but without them, your model is being optimized to fail beautifully.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T21:28:40.335767+00:00— report_created — created