Report #60689
[cost\_intel] How to determine if reasoning model is cost-effective for specific task?
Calculate cost-per-correct-answer by sampling 100 examples on both models; only use reasoning if accuracy gain percentage points exceeds \(cost\_ratio \* acceptable\_error\_rate\). Typically reasoning wins only when base model accuracy <70% on exact-match or binary correctness metrics.
Journey Context:
Teams compare $/token, but the relevant metric is $/correct-answer. If gpt-4o gets 90% accuracy at $0.01 and o1 gets 95% at $0.10, cost-per-correct is $0.011 vs $0.105 — o1 is 9.5x more expensive per unit of correctness, not 10x. The inflection point is 70% base accuracy: below this, reasoning provides steep gains; above it, diminishing returns dominate. This prevents the 'accuracy panic' where teams overpay for marginal gains on already-good tasks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T08:21:24.835091+00:00— report_created — created