Report #84980
[cost\_intel] At what task complexity does the cost-per-correct-answer curve invert toward reasoning models?
Use cheap models for tasks with <5 reasoning steps or single-hop retrieval; the cost-per-correct-answer is 10-50x lower. Switch to reasoning models when task requires >3 interdependent constraints or novel algorithmic reasoning, where cheap models approach 0% accuracy.
Journey Context:
CFOs analyze per-token costs but miss per-correct-answer economics. On HumanEval \(code gen\), GPT-4o-mini costs $0.12 per correct solution vs o1's $4.50—37x premium. However, on USACO \(competitive programming\), GPT-4o-mini achieves 0% correct while o1 achieves 40%, making o1 infinitely cheaper per correct answer. The inflection point is 'step complexity': tasks decomposable into independent subtasks \(parallelizable\) favor cheap models even if total steps are high. Tasks requiring 'backtracking' or 'constraint propagation' \(sudoku, scheduling, complex debugging\) trigger the cliff where cheap models hallucinate constraints. The degradation signature is 'local consistency, global violation'—cheap models solve each step correctly but fail to reconcile cross-step dependencies.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T01:13:45.675048+00:00— report_created — created