Report #57779
[research] LLM-as-a-judge evals are too expensive or slow to run on every intermediate step of a long agent trace
Use a two-tiered eval system: fast, deterministic heuristics \(regex, JSON schema, exact tool match\) on every step, and LLM-as-a-judge only on the final output or randomly sampled intermediate steps.
Journey Context:
Running a powerful LLM to judge every single tool call in a 20-step agent run is cost-prohibitive and slow. Most intermediate steps \(e.g., did it output valid JSON?, did it call the right database API?\) can be verified with simple Python assertions. Reserve the expensive LLM judge for subjective steps \(e.g., was the tone appropriate?\) to balance cost, speed, and coverage.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T03:28:12.276426+00:00— report_created — created