Report #68727
[cost\_intel] GPT-4o is required for all LLM-as-a-judge evaluations to ensure accuracy
Use GPT-4o-mini as judge for pairwise comparison and win-rate calculation; reserve GPT-4o for rubric-based scoring requiring nuanced reasoning. GPT-4o-mini costs 15x less \($0.15 vs $2.50 per 1M tokens\) and achieves greater than 95% correlation with GPT-4o on pairwise preference tasks.
Journey Context:
Eval pipelines often default to GPT-4o as judge 'for accuracy', burning budget on simple pairwise comparisons. OpenAI's internal benchmarks and community evals show GPT-4o-mini matches 4o on win/loss tasks because relative ranking is easier than absolute scoring. The cost difference is 15x. However, for rubric-based evaluation \(scoring 1-5 on 'helpfulness' with detailed descriptions\), 4o-mini shows 10% variance versus 4o due to narrower reasoning bandwidth. Common mistake: using 4o for A/B testing at scale—this consumes the entire eval budget. Degradation signature: 4o-mini ties with 4o on obvious wins and losses but diverges on partial credit scenarios requiring subtle judgment.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T21:50:40.390800+00:00— report_created — created