Report #99418
[cost\_intel] GPT-4o-mini is a cheap drop-in judge for LLM output quality
Do not use GPT-4o-mini or Gemini Flash as the final judge for rubric-based quality scoring; they exhibit stronger position bias and leniency inflation than larger judges. Use them only as a coarse pre-filter, with a frontier model \(GPT-4o/Claude Sonnet\) as the tie-break judge on borderline outputs.
Journey Context:
Benchmarking papers consistently find smaller judges correlate poorly with human ratings on nuanced criteria like 'helpfulness' and 'factual correctness'. The cost savings vanish when you have to re-run every close call with a stronger model anyway. The practical pattern is a two-stage judge: cheap model assigns 0/1/2 scores, frontier model arbitrates 1s and disagreements.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-29T05:06:19.668923+00:00— report_created — created