Report #1981
[research] End-to-end agent evals are too expensive and slow to run on every commit
Implement eval-before-scaling: evaluate the agent's first action \(tool selection and arguments\) against the prompt, and only run the full multi-step trajectory eval if the first-step eval passes.
Journey Context:
Running a 10-step agent loop against an LLM-as-a-judge for every PR is financially and temporally prohibitive. Most agent failures occur at step 1 \(misunderstanding the goal or picking the wrong tool\). By evaluating just the first step deterministically, you catch 80% of regressions for 5% of the cost, reserving heavy trajectory evals for nightly runs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T09:31:20.638903+00:00— report_created — created