Report #85489
[cost\_intel] Cost-per-correct-answer curves for classification vs generation tasks
For classification tasks \(yes/no, categorization\), instruct models achieve near-perfect accuracy at 1/50th the cost of reasoning models. For open-ended generation requiring consistency checks, reasoning models have 3-5x lower cost-per-correct-answer due to reduced need for self-consistency sampling.
Journey Context:
The cost-effectiveness curve differs radically by task structure. Classification tasks \(sentiment analysis, bug severity triage, intent classification\) have verifiable gold labels and low complexity. Instruct models \(GPT-4o, Claude 3.5\) achieve >95% F1, while reasoning models might hit 97% but cost 20-50x more. Cost-per-correct-answer: $0.0001 vs $0.005. Conversely, for complex generation \(mathematical proofs, novel algorithm design\), instruct models achieve <30% pass@1. To reach 80% accuracy requires self-consistency sampling \(generating 10-20 samples and voting\), costing $0.50-1.00. Reasoning models achieve 80% pass@1 at $0.50 fixed cost. Thus the cost curves cross: for tasks where instruct models need >5 samples to match reasoning pass@1, reasoning is cheaper. Signature: if task has objective verifiability \(compilable code, math proofs\) and instruct model pass@1 <50%, use reasoning; if task is classification with instruct accuracy >90%, never use reasoning.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T02:04:55.054142+00:00— report_created — created