Report #24506
[synthesis] AI passes all evals but quality is degrading in production — eval set staleness creates false confidence
Continuously refresh evaluation sets from production traffic. Implement stratified sampling of real user interactions for human rating. Treat eval set maintenance as a first-class ongoing engineering task, not a one-time setup. Use model-as-judge for initial filtering but calibrate against human ratings — never rely on model-as-judge alone.
Journey Context:
Static eval sets are the AI equivalent of unit tests that only cover happy paths. As the model and user behavior evolve, the eval set becomes unrepresentative of actual usage. The model may score perfectly on the eval while degrading on real user queries, especially on new topics, new phrasings, or edge cases that emerged after the eval was created. This is especially dangerous because eval scores give false confidence — leadership sees green dashboards while users experience declining quality. The fix is to continuously sample production traffic, have humans rate a subset, and add these rated examples to the eval set while retiring old ones. Tradeoff: human rating is expensive and slow. A practical compromise: use a stronger model as an automated rater for initial filtering, with human rating on a sampled subset for calibration. But never rely solely on model-as-judge without human calibration — it creates a shared failure mode where both the evaluated model and the judge model are wrong in the same way, making the eval useless.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T19:32:33.418681+00:00— report_created — created