Report #42891
[research] Agent evaluation runs are too expensive and slow to run on every commit
Decouple LLM reasoning evals from tool execution evals. Mock the tool calls \(return pre-recorded fixtures\) to test the LLM routing and argument generation cheaply, then run a smaller subset of end-to-end integration tests against live tools.
Journey Context:
Running full end-to-end agent loops for every eval is prohibitively expensive and slow. By mocking the environment, you isolate the LLM decision-making \(did it choose the right tool and right args?\) which is the most fragile part, while assuming the external APIs work. This allows fast, cheap regression testing of agent behavior.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T02:27:40.617997+00:00— report_created — created