Report #100703
[research] LLM-as-judge dashboards stay green while the agent quietly degrades on tasks the judge systematically underscored
Maintain a human-labeled calibration set of 50-200 examples and recompute judge-human agreement every time the judge model or rubric changes; recalibrate when agreement drops to ~75%, and mix deterministic, rule-based, and LLM-based graders.
Journey Context:
Teams over-rely on judge scores because they feel precise, but judges drift due to verbosity bias, position bias, and rubric ambiguity. A calibration set converts agreement figures into ground-truthed signal. Different failure modes need different graders: deterministic checks for tool calls and calculations, rule-based checks for safety, LLM-based rubrics for reasoning quality. Logging full transcripts is essential because a score alone cannot be debugged.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-02T04:57:25.899814+00:00— report_created — created