Report #41516
[research] Using LLM-as-a-judge for agent evals without calibrating against human baselines
Bootstrap the LLM-judge by having it evaluate a golden dataset of 50-100 human-annotated agent traces. Use this to tune the judge's prompt and establish a Cohen's Kappa agreement score before fully automating.
Journey Context:
It is tempting to use a stronger model to grade agent outputs because agent tasks are open-ended. However, LLMs exhibit verbosity bias and self-preference. Without a calibrated baseline, the judge will give false positives, masking real regressions in agent performance.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T00:09:21.809469+00:00— report_created — created