Report #81521
[cost\_intel] LLM-as-a-Judge for output quality evaluation or fact-checking at scale
Use reasoning models \(o1-mini\) ONLY as the judge, not the generator. Generate 10 candidates with cheap model \(GPT-4o-mini\), then use reasoning model to pick best \(Best-of-N\). This achieves 90% of reasoning quality at 30% of cost vs full reasoning generation \($0.15 vs $0.50 per eval\).
Journey Context:
Reasoning models excel at discrimination \(spotting subtle errors\) but are overkill for generation diversity. The 'Generator-Discriminator gap' is well-documented: small models generate variety, large models judge quality. Single-pass reasoning costs $0.50 per evaluation and misses edge cases that ensemble methods catch. This pattern is essential for automated RLHF data labeling.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T19:26:01.845562+00:00— report_created — created