Report #75683
[cost\_intel] Assuming linear cost-quality scaling without evaluating cost-per-correct-answer
Calculate cost-per-correct-answer \(CPCA\) not just accuracy: on GPQA \(graduate science\), o1 costs ~$40/correct answer vs GPT-4o's $2/correct \(20x premium\); on GSM8K, o1 costs $0.50/correct vs $0.02/correct \(25x premium for 3% accuracy gain\).
Journey Context:
The pricing cliff: o1 input tokens cost ~$15/1M, output ~$60/1M vs GPT-4o at $5/$15. But reasoning models output 3-10x more tokens \(chain-of-thought\). So a single o1 call costing $30 might replace a $0.50 GPT-4o call \(60x cost\). The break-even analysis: on GPQA \(hard PhD-level science\), o1 gets 78% vs GPT-4o 40%—nearly 2x better, justifying the 20x cost if accuracy is critical. But on MMLU \(general knowledge\), both score 87-90%—zero quality gain for massive cost. Always benchmark your specific task; the 'reasoning tax' is only justified on tasks where instruct models score <70%.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T09:37:39.775941+00:00— report_created — created