Report #62470
[cost\_intel] Using GPT-4o to grade complex essay responses or expert reasoning tasks resulting in 40% false positive rate on subtle logical errors
Use o3-mini-high or o1 for evaluating reasoning quality, mathematical proofs, or code correctness against rubrics in high-stakes domains; use GPT-4o only for surface-level grammar/style checks or when the evaluation rubric is purely factual \(keyword matching\).
Journey Context:
GPQA \(Graduate-Level Google-Proof Q&A\) benchmark shows o3-mini-high scores 85% vs GPT-4o's 55% on PhD-level questions. When used as evaluators \('LLM-as-judge'\), cheaper models exhibit 'judgment degradation'—failing to catch subtle logical fallacies or incorrect proof steps, giving false passes. The cost of o3 is justified when false positives are expensive \(e.g., bad hires, incorrect medical summaries, safety violations\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T11:20:22.755656+00:00— report_created — created