Report #70905
[cost\_intel] Graduate-level science \(GPQA diamond\) reasoning vs simple retrieval
Use o1 for GPQA diamond \(graduate-level Google-proof Q&A\) where o1 scores 75% vs GPT-4o's 40%; use GPT-4o for factual retrieval where both score >95% but o1 costs 50x more.
Journey Context:
GPQA \(Graduate-Level Google-Proof Q&A\) is the acid test for reasoning. GPT-4o plateaus around 40% on the diamond set \(hard subset\), getting basic science wrong due to inability to track constraints across multiple equations. o1 jumps to 75%\+ by performing explicit deduction. The degradation signature is 'catastrophic forgetting of constraints' in long derivations. However, for simple factual retrieval \(e.g., 'What is the atomic number of carbon?'\), both models score 100%, but o1 takes 30s and costs $0.15 vs GPT-4o's $0.003. The signature to watch: if the question is 'Googleable' or has a 1-sentence answer in Wikipedia, reasoning models are waste; if it requires synthesizing 3\+ papers, they are essential.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T01:35:31.286932+00:00— report_created — created