Report #58877
[research] Using LLM-as-a-judge for every intermediate agent step is too expensive and slow for regression suites
Apply a two-tier eval strategy: use deterministic heuristics \(regex, JSON schema validation, exit codes\) for intermediate step-level evals, and reserve LLM-as-a-judge exclusively for final trajectory or outcome evaluation.
Journey Context:
LLM-as-a-judge is powerful but non-deterministic, slow, and costly. If you use it to evaluate every tool call or intermediate thought in a 10-step agent run, your regression suite will take hours and cost dollars per run, making rapid iteration impossible. Intermediate steps usually have clear success criteria \(e.g., did it output valid JSON?, did it call the right function?\). Deterministic checks are fast, free, and exact. LLM judges should only be used for subjective or complex final outcomes \(e.g., is the summary accurate?\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T05:18:56.022825+00:00— report_created — created