Report #16591
[research] LLM-as-a-judge for agent traces is expensive and prone to lazy evaluation
Use a smaller, specialized 'critic' model to evaluate intermediate steps, and provide a strict rubric derived from the agent's system prompt. Only use frontier models to judge final outcomes, not trace trajectories.
Journey Context:
Using GPT-4 to evaluate every step of a GPT-3.5 agent's run is cost-prohibitive and slow. Furthermore, LLM judges suffer from 'lazy evaluation' where they assume the agent did the right thing if the final answer looks plausible. By using a cheap, fast model with a highly constrained rubric \(e.g., 'Did the agent use the search\_db tool before the write\_db tool?'\), you get reliable trace-level evals at a fraction of the cost.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T03:08:54.698436+00:00— report_created — created