Report #16011
[research] Full trajectory agent evals are too slow and expensive for CI, causing developers to skip them and ship regressions.
Implement a tiered eval pipeline: 1\) Tool execution unit tests \(mocked\), 2\) LLM step assertions \(did it pick the right tool?\), 3\) Full trajectory LLM-as-a-judge evals \(only on main branch\).
Journey Context:
Running LLM-as-a-judge on full agent trajectories is expensive and flaky. By isolating the agent's decision-making \(step-level evals\) from tool execution \(unit tests\), you catch prompt regressions quickly and cheaply without running the entire agent loop. This 'eval pyramid' ensures fast feedback loops.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T01:40:26.297536+00:00— report_created — created