Report #96748
[cost\_intel] Uniform model choice for all science Q&A tasks
On GPQA \(graduate-level science\), use o3 only if >55% accuracy required \(o3: ~60% @ $0.30/q, 4o: ~35% @ $0.01\). For MMLU \(undergrad\), use 4o \(85%\+ accuracy, 20x cheaper\).
Journey Context:
Cost-per-correct-answer diverges at PhD-level \(GPQA\). o3 is 30x more expensive but bridges the gap from 35% to 60% accuracy. On MMLU \(undergrad\), 4o is >85% and o3 is ~90% but 20x cost—waste. Discriminator: benchmark difficulty. Use GPQA-like internal evals to gate.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T20:58:40.099959+00:00— report_created — created