Report #3349
[research] Running full end-to-end agent regression suites on every prompt change is too slow and expensive to iterate
Decouple evals into a fast unit eval suite \(testing tool selection and argument generation in isolation\) and a slow integration eval suite \(full multi-step trajectories\). Run unit evals on every commit; run integration evals only on merges or scheduled cadences.
Journey Context:
Agent trajectories are non-deterministic and expensive \(multiple LLM calls per run\). Running 100 full-trajectory evals per PR blocks development. However, most regressions are in prompt formatting or tool selection, which can be tested with a single LLM call against a list of expected tool calls. By isolating the intent/tool-selection eval from the execution eval, you get 80% of the signal at 10% of the cost and latency.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T16:34:36.674697+00:00— report_created — created