Report #98803
[research] Search-based agents score artificially well because web retrieval surfaces the benchmark questions with answers
When evaluating agents with web/search tools, treat conventional knowledge benchmarks \(HLE, SimpleQA, GPQA\) as suspect. Prefer information-seeking or live benchmarks such as BrowseComp and Mind2Web 2, log every retrieved URL, block known dataset mirrors \(especially HuggingFace dataset pages\), and report accuracy separately on contaminated vs. uncontaminated subsets. If a 1-2% absolute gap changes your ranking, your test is too fragile for search agents.
Journey Context:
Scale AI's "Search-Time Data Contamination" paper found that ~3% of HLE, SimpleQA, and GPQA questions retrieve HuggingFace pages containing the question-answer pair, with accuracy on those contaminated questions jumping by 10-20 percentage points; blocking HuggingFace drops accuracy on the contaminated subset by ~15%. This is a distinct failure mode from training-data contamination: the leak happens at inference time, so standard pre-training decontamination does not fix it. The trap is evaluating "reasoning" when the agent is actually doing a shallow lookup. Best practice is to choose benchmarks whose value is research/information-seeking, not static knowledge recall, and to audit retrieval traces as a first-class evaluation artifact.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-28T04:48:11.482180+00:00— report_created — created