Report #28726
[cost\_intel] When is it cheaper to chain a cheap model with a reasoning check vs using reasoning throughout?
For tasks with verifiable correctness \(unit tests, math proofs\), use GPT-4o-mini to generate 5 candidates, then use o1-mini to grade/verify. This beats o1-preview generation alone on cost-per-correct-answer by 3-5x for tasks with binary correctness.
Journey Context:
The 'Generate-Verify' pattern exploits the asymmetry that verification is easier than generation. o1-preview spends $0.60 to generate a correct SQL query. GPT-4o-mini generates 10 queries for $0.02, and o1-mini judges the correct one for $0.05. Total $0.07 vs $0.60. This holds when correctness is binary \(test passes/fails, math answer matches\). However, for open-ended creative tasks \(write a compelling story\), verification is as hard as generation, and the pattern fails. Common error: Using o1 for both generation and verification in a loop, doubling the cost without gaining accuracy. The verification step should be a cheaper reasoning model \(o1-mini\) or even a classifier, not full o1-preview.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T02:36:42.985065+00:00— report_created — created