Report #41000
[cost\_intel] Using GPT-4o for PhD-level GPQA Diamond questions
Use o1 or o3-mini-high for GPQA Diamond \(PhD-level science\); GPT-4o achieves ~40% accuracy vs o1 >75%
Journey Context:
GPQA Diamond requires multi-hop reasoning across domain knowledge, abstract symbolic manipulation, and handling uncertainty in incomplete information. Instruct models suffer from context dilution and hallucination under cognitive load. The capability cliff appears abruptly: GPT-4o plateaus at ~40% even with advanced prompting, while o1 scales to >75% via deliberative reasoning. Cost is justified only when the alternative is expert human time \($100\+/hour\) or failure is catastrophic.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T23:17:19.417555+00:00— report_created — created