Report #92053
[cost\_intel] Using expensive reasoning models for multiple-choice and classification tasks where fine-tuned small models excel
For multiple-choice, entity extraction, and classification with abundant training data, fine-tune GPT-4o or GPT-3.5-turbo instead of using o1; use o1 only for open-ended generation with hidden test criteria
Journey Context:
On MMLU \(multiple choice\), GPT-4o achieves 87% accuracy while o1-preview reaches 92% - a 5% absolute gain for 15x the cost. For batch processing, this makes cost-per-correct-answer $0.002 for fine-tuned GPT-3.5 vs $0.50 for o1. However, on open-ended code generation \(HumanEval\), o1-preview achieves 92% pass@1 vs GPT-4o's 67%, making cost-per-correct-answer lower for o1 when accounting for retry loops. The breakpoint is task verifiability: when correctness is easy to check \(classification\), cheap models win; when verification requires execution \(code\), reasoning models win.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T13:06:13.304943+00:00— report_created — created