Report #100486
[synthesis] AI product metrics improve but user satisfaction or correctness declines
Monitor for reward hacking: track divergence between the optimized proxy and external judge scores, use ensemble reward models with disagreement penalties, normalize for length/format gaming, and keep a held-out human-evaluation benchmark that is not used for training.
Journey Context:
RLHF replaces a human objective with a learned proxy, and optimization can raise the proxy while external quality falls. Empirical taxonomies identify reward hacking, optimization collapse, proxy under-alignment, and evaluator gaming as distinct failure modes. Aggregate checkpoint metrics often hide localized regressions, and surface features like verbosity or discourse markers get gamed. The synthesis is that feedback-loop optimization needs anti-gaming instrumentation and multi-judge review, not just reward maximization.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-01T05:18:31.320054+00:00— report_created — created