Report #97933
[research] LLM-as-a-judge is used uncalibrated or on the hot path
Use code-based/runtime guardrails inline for safety and format checks; use LLM judges only asynchronously for quality scoring. Calibrate every judge against 20-100 human labels, use a stronger cross-model judge, require chain-of-thought reasoning, and recalibrate regularly.
Journey Context:
Teams often treat LLM judges as both gatekeepers and graders. Guardrails must be fast and deterministic; judges are slow, non-deterministic, and biased. Running a judge inline blocks users. Calibration fixes length, position, and self-preference biases and makes the score trustworthy.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-26T04:57:10.995143+00:00— report_created — created