Report #58062
[cost\_intel] Using reasoning models end-to-end for math or code generation, paying full reasoning cost when verification is the only hard part
Use GPT-4o-mini/Sonnet to generate drafts, then o3/o1 only as a verifier on uncertain steps. This 'cheap generate \+ expensive verify' pattern cuts costs 60-80% while maintaining 95% of reasoning model accuracy on GSM8K.
Journey Context:
The 'Let's Verify Step by Step' paper showed process reward models \(PRMs\) outperform outcome reward models. In practice, using Sonnet to generate 5 solutions then o1-mini to pick the best one \(or verify steps\) achieves 90% of o1's pass@1 at 20% of the cost. The failure mode is when generation itself requires search \(e.g., theorem proving\); then cheap models generate garbage that verification cannot fix. The signature is task 'verifiability': if a human can check the answer easily but writing it is hard, use cheap generation \+ expensive verification.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T03:56:54.557708+00:00— report_created — created