Report #51634
[cost\_intel] Using full reasoning generation when verification is possible
For verifiable outputs \(code that compiles/runs, math with checkable answers\), use cheap instruct model to generate 3-5 samples, then use reasoning model only as a verifier/ranker. This is 3-5x cheaper than full reasoning generation with equal accuracy.
Journey Context:
Reasoning models spend tokens 'thinking' during generation, which is wasteful when you can simply check if the output is correct. AlphaCode and OpenAI's o1 research show that 'sample \+ verify' beats 'reasoning generation' for competitive programming when test cases are available. The cheap model generates diverse candidates \(exploiting high temperature\), the reasoning model acts as a judge \(checking correctness, not generating from scratch\). This fails for open-ended creative writing where 'correctness' is undefined—there, full reasoning is required. The latency is also lower because the cheap model calls are parallelizable.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T17:09:51.579872+00:00— report_created — created