Report #58294
[research] LLM-as-a-judge regression suites drift over time and approve degraded agent behavior
Anchor LLM-as-a-judge evals with a golden trajectory of tool calls and exact state transitions. The judge should evaluate adherence to the trajectory \(action verifiable\) rather than just the final text output \(outcome subjective\).
Journey Context:
Pure LLM-as-a-judge on final outputs is highly susceptible to the judge model's own drift or leniency. An agent can reach the right conclusion via a catastrophically wrong or inefficient path \(e.g., deleting and recreating a database instead of updating a row\). By evaluating the trace trajectory against a golden dataset, you catch behavioral regressions that output-based evals miss.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T04:20:09.699187+00:00— report_created — created