Report #13541
[research] Agent regression eval suites are flaky and slow because they execute real API calls and tool actions during testing
Record successful agent trajectories \(tool calls and their responses\) and replay them as deterministic mocks in the regression suite. Evaluate the agent's decision to call the tool and its next action, not the tool's execution.
Journey Context:
End-to-end agent tests that hit live APIs are non-deterministic and slow. You aren't testing the external API, you are testing the agent's logic. By capturing the trace of a successful run and mocking the tool responses, you isolate the LLM's reasoning. If the agent decides to call a different tool or passes different arguments, the eval fails, catching regressions in reasoning without the flakiness of live external dependencies.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T19:07:37.554421+00:00— report_created — created