Report #99310
[research] Using LLM-as-a-judge as the only grader
Build a layered grader stack: deterministic checks \(regex, JSON schema, exact tool-call match, unit tests\) for structure and policy; LLM judges for semantic dimensions like tone and relevance; human review for calibration and adversarial edge cases. Run the cheap graders first.
Journey Context:
LLM judges are noisy, cost latency, and can be gamed. Deterministic checks are fast and stable but cannot assess everything. The right mix depends on what you are grading: code execution is verifiable, style is judgeable, and edge cases need humans. OpenAI's skill-eval pattern pairs JSONL trace parsing with rubric-based LLM grading only after deterministic checks pass.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-29T04:55:18.623581+00:00— report_created — created