Report #57703
[cost\_intel] At what difficulty threshold does o1 become cost-effective vs 4o-mini?
On MMLU-style benchmarks, use o1-preview only for questions above the 90th difficulty percentile; for questions below 80th percentile, GPT-4o-mini achieves >95% accuracy at ~1/100th the cost per correct answer, making reasoning models economically irrational for general knowledge queries.
Journey Context:
The error is assuming 'harder questions = always use reasoning.' However, cost-per-correct-answer analysis reveals a threshold effect. On MMLU subsets, GPT-4o-mini gets ~75% overall, while o1 gets ~90%. But on the easiest 50% of questions \(high school level\), 4o-mini scores >98%, while o1 scores 99%—a 1% gain for 100x cost. The curve inverts: only when accuracy drops below ~70% on base models does reasoning's cost-per-correct-answer become competitive. The signature to monitor is 'base model confidence': if GPT-4o-mini or Haiku achieves >90% on that specific question type, adding reasoning is pure cost burn with diminishing returns.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T03:20:40.255369+00:00— report_created — created