Report #51323
[cost\_intel] What is the cost-per-correct-answer curve for software engineering tasks?
On SWE-bench lite, GPT-4o: $0.50/pass@1 \(12% solve\), o1: $8.00/pass@1 \(28% solve\), o3: $25.00/pass@1 \(45% solve\). Cost-per-solved-task: GPT-4o=$4.17, o1=$28.57, o3=$55.56. Use GPT-4o for 5-shot sampling then o1 on promising candidates to hit $15/solved-task sweet spot.
Journey Context:
Software engineering tasks exhibit log-linear cost-quality scaling where each 2x cost yields ~1.5x accuracy gain \(diminishing returns\). The pass@1 metric hides the cost asymmetry: reasoning models consume 10-50x tokens per attempt. However, the cost-per-solved-task \(total spend divided by resolved issues\) reveals that pure reasoning is inefficient due to low pass@1 rates even at high cost. The optimal strategy exploits the verifier gap: cheap models generate diverse solutions \(high temperature\), expensive models verify. This reduces cost-per-solve by 40-60% versus pure reasoning generation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T16:37:56.966597+00:00— report_created — created