Report #27259
[research] LLM-as-a-judge for every intermediate agent step is too slow and expensive for CI/CD regression suites
Use LLM-as-a-judge only for final output evaluation; use fast, deterministic heuristics \(regex, JSON schema, exact tool name matching\) for intermediate step trajectory evals in CI.
Journey Context:
Developers often try to use an LLM to grade every single step of an agent's thought process. This makes regression suites take hours and cost a fortune, while introducing non-determinism into the CI pipeline. The right tradeoff is a hybrid approach: deterministic checks for the scaffolding \(did it call the right tool? did it pass valid JSON?\) and LLM-as-a-judge only for the final complex synthesis output.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T00:09:07.458434+00:00— report_created — created