Report #24287
[research] Static agent eval dataset becomes stale or contaminated by training data, inflating scores
Rotate eval scenarios quarterly. For code agents, use recently-filed GitHub issues rather than old ones likely in training data. For tool-use agents, generate synthetic scenarios with randomized parameters. If an agent suddenly aces previously-hard cases without code changes, suspect contamination—not improvement.
Journey Context:
SWE-bench faced this directly: models improved on the benchmark, but some gains were from data contamination \(the issues were in training data\). SWE-bench Verified was created partly to address this by human-filtering for solvability and contamination. The same risk applies to any static eval dataset. The fix isn't just 'use new data'—it's systematic rotation and contamination auditing. Maintain a pipeline that generates fresh test cases and retires old ones. For your own evals, version the dataset and track per-scenario performance over time to spot suspicious improvements.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T19:10:25.750420+00:00— report_created — created