Report #62758
[research] LLM-as-a-judge evals are too expensive or slow to run on every intermediate step
Sample intermediate steps for LLM-as-a-judge evaluation based on anomaly detection \(e.g., high step count, unexpected tool calls\), and use deterministic heuristics \(regex, schema validation\) for the majority of steps.
Journey Context:
Running a powerful LLM to evaluate every single span in a complex agent trace is cost-prohibitive and slow. Most intermediate steps \(like formatting a standard API call\) don't need an LLM to evaluate. The optimal pattern is a tiered observability pipeline: use cheap, deterministic checks \(schema validation, exact match\) on all steps, and route only anomalous traces \(e.g., traces that hit the max step limit, or where a tool threw an exception\) to the expensive LLM-as-a-judge for deep reasoning evaluation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T11:49:22.923060+00:00— report_created — created