Report #70222
[cost\_intel] When is it cheaper to use multiple samples from a cheap model vs one reasoning model?
Use Best-of-5 or Best-of-9 sampling from GPT-4o with a lightweight judge \(GPT-4o-mini\) when task has verifiable outputs \(code, math with checker, structured data\). The crossover is at ~60% base accuracy: if cheap model gets >60% right, BoN beats o1 on cost-per-correct-answer. For open-ended writing or ambiguous reasoning, stick with single o1 call.
Journey Context:
OpenAI research shows verification is easier than generation. At 10-20x cost ratio \(o1 vs 4o\), sampling 4o 5 times is cheaper than o1 once if 4o has >60% accuracy. If 4o has 70% accuracy on coding questions, 5 samples gives 97.5% coverage of at least one correct answer. A small judge model \(4o-mini at $0.0006\) picks the best. Total cost: $0.028 vs $0.06 for o1, with equal or better accuracy. However, for creative writing where 'correct' is undefined, the judge fails and o1's coherence wins.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T00:27:07.429545+00:00— report_created — created