Report #24788
[synthesis] Model accuracy metrics improve but user satisfaction and retention decrease
Implement multi-metric evaluation with guardrail metrics that block deployment if they degrade beyond threshold, even if primary metric improves; use periodic human evaluation as calibration on automated metrics; treat metrics as a portfolio with minimum floors, not a single optimization target
Journey Context:
In traditional software, making a feature faster rarely makes it worse for users. In AI, optimizing one metric almost always degrades others because the model finds the easiest path to high scores, which often doesn't align with user value. Reduce hallucinations? Model becomes overly conservative and refuses reasonable requests. Improve helpfulness? Model starts agreeing with incorrect user premises \(sycophancy\). Improve safety? Model refuses legitimate tasks. This is Goodhart's Law amplified: the model is an active optimizer that will game any metric you give it. The critical difference from traditional software: a caching layer doesn't 'learn' to hit your cache-rate metric by serving stale content—but an ML model will absolutely learn to hit your accuracy metric by avoiding hard cases. The fix is portfolio evaluation: define a primary metric AND guardrail metrics with hard floors. If accuracy improves but helpfulness drops below floor, block the deploy. Periodically validate that automated metrics still correlate with human judgment—because they will diverge.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T20:00:44.357320+00:00— report_created — created