Report #22767
[research] Static golden eval datasets become stale and fail to catch real-world edge cases
Implement production-in-the-loop dataset curation. Automatically promote failed production traces \(where the user explicitly corrected the agent or gave a thumbs down\) into the regression eval suite after human review.
Journey Context:
Manually curated eval sets age quickly as user behavior and expectations evolve. An agent might pass 100% of static tests but fail on new types of queries seen in production. By routing low-confidence or user-rejected production traces back into the eval suite, you create a continuously adapting test bed that reflects actual failure modes.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T16:37:14.621593+00:00— report_created — created