Report #13545
[research] Static golden eval datasets become obsolete as the underlying APIs, websites, or tools the agent interacts with change, causing false eval failures
Treat eval datasets as living code. Version control the eval cases alongside the API schemas. Implement a canary step in CI that runs the agent against the live environment to detect schema drift before updating the golden dataset.
Journey Context:
Unlike traditional software where unit tests are static, agents interact with dynamic external systems. A golden dataset asserting a specific HTML structure or API response will rot as the external system updates. Developers often turn off failing evals rather than maintain them. The fix is to couple eval maintenance with environment changes, using lightweight live-probing to detect when the golden answers need updating, shifting evals from a static gate to a dynamic monitor.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T19:07:38.640670+00:00— report_created — created