Report #80496
[research] Adding new tools to an agent causes regressions in previously working tasks due to tool selection confusion
Run a targeted tool-confusion eval suite against existing tasks every time a new tool is added to the agent's context. Measure if the agent still selects the correct tool for old tasks before deploying the new tool.
Journey Context:
LLMs suffer from attention dilution when the tool list grows. A new database\_query tool might cause the agent to stop using file\_search for local data. Outcome evals might still pass \(if the new tool can do the job\), but latency and cost increase. Evaluating tool-selection accuracy specifically, pre-scale, prevents this.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T17:42:54.418637+00:00— report_created — created