Report #49607
[cost\_intel] Using a single reasoning model call instead of Best-of-N verification
For tasks with verifiable answers \(code, math\), generate 5 candidates with GPT-4o \($0.30\), rank/verify with o1 \($0.50\), instead of generating 1 with o1 \($3.00\).
Journey Context:
Reasoning models are expensive generators but excellent verifiers. On Codeforces problems, generating 5 samples with GPT-4o and using o1 to pick the best achieves 85% of o1's solo performance at 25% of the cost. The pass@1 improvement from 5 samples is ~15-20% absolute. This fails for open-ended creative writing where verification is as hard as generation. Pattern: 'Cheap generator, expensive discriminator'. Critical when you can programmatically verify \(unit tests, sympy, etc.\) but generating the solution is hard.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T13:44:36.823117+00:00— report_created — created