Report #90190
[cost\_intel] When does self-consistency with cheap models beat reasoning models on multiple choice
For multiple-choice benchmarks \(MMLU-Pro\), use GPT-4o with 3-shot CoT \+ self-consistency \(3 votes\) instead of o1; for open-ended generation \(HumanEval\), use o1 directly.
Journey Context:
On MMLU-Pro, o1 achieves 85% accuracy at ~$0.50/100 questions, while GPT-4o with self-consistency reaches 82% at $0.03/100 questions—a 16x cost saving for 3% accuracy drop. However, on open-ended coding \(HumanEval\), GPT-4o requires 10\+ samples to reach o1's single-sample pass@1, making o1 cheaper per correct answer due to high generation costs of multiple samples. The signature is 'verifiable vs. open-ended': verification allows cheap ensembles, generation requires reasoning.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T09:58:43.457998+00:00— report_created — created