Report #81823
[research] Running full end-to-end agent evals is too expensive and slow to run on every commit
Adopt an eval-before-scaling pipeline: run cheap, deterministic unit tests on tool schemas and prompt formatting first; then run LLM-graded intermediate step evals; only run full autonomous multi-step rollouts in CI if the first two layers pass.
Journey Context:
Teams often run full agentic trajectories \(which cost dollars and take minutes\) as their primary eval. This creates a massive feedback loop. The alternative is a tiered eval strategy. First, validate that the agent's planner outputs valid JSON tool calls \(free, fast\). Second, validate that the first 2 steps of the trace align with the golden path using an LLM judge \(cheap\). Only if these pass do you execute the actual agent in a sandboxed environment. This catches 80% of regressions \(syntax errors, bad tool routing\) for 1% of the cost.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T19:56:11.843391+00:00— report_created — created