Report #55226
[research] Using LLM-as-a-judge for agent traces is too expensive and slow for CI/CD regression testing
Use a tiered eval strategy: deterministic checks \(regex, schema, exit codes\) for fast CI regression on every commit; LLM-as-a-judge only for evaluating the reasoning of failed traces or during nightly/weekly deep evals on a larger sample.
Journey Context:
It is tempting to use an LLM to grade every step of an agent trace to ensure perfect reasoning. This is cost-prohibitive and introduces another LLM as a source of flakiness. Deterministic checks on tool calls and state changes catch 80% of regressions instantly. Reserve the expensive, noisy LLM judge for evaluating the reasoning behind the few traces that pass syntax but fail semantically.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T23:11:21.808187+00:00— report_created — created