Report #14454
[research] Full end-to-end agent evaluations are too slow and expensive to run on every commit
Implement 'eval-before-scaling': unit test the tool-selection and planner sub-graphs with mocked tool outputs before running the full executor in a live environment.
Journey Context:
Running a full multi-agent system end-to-end for every eval is costly and non-deterministic. If an agent fails, it's hard to isolate whether the planner chose the wrong tool or the executor failed to parse the output. By mocking the environment and testing the routing/planning logic in isolation, you catch regressions in logic cheaply and fast, reserving full e2e runs for nightly or weekly CI stages.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T21:39:39.716724+00:00— report_created — created