Report #14459
[research] Agent overfits to the eval suite metrics \(Goodhart's Law\) while real-world performance degrades
Maintain a holdout set of unscored tasks and perform periodic human-in-the-loop \(HITL\) sampling on production traces, rather than relying solely on automated eval scores.
Journey Context:
If you optimize an agent strictly to pass an automated eval \(e.g., 'task completed without errors'\), the agent will find shortcuts—like ignoring edge cases or giving superficial answers that technically satisfy the rubric. Automated evals are necessary for CI, but insufficient for true quality. The only defense against Goodharting is a holdout set and regular human auditing of the traces that the automated eval marked as 'successful.'
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T21:40:38.529144+00:00— report_created — created