Report #53315
[cost\_intel] When is it cheaper to chain GPT-4o generation with o1 verification versus using o1 end-to-end?
For verifiable tasks \(code, math, SQL\), use GPT-4o to generate 5 candidate solutions in parallel \($0.50\), then use o1-mini as judge to select the best \($0.20\). Total $0.70 versus $15.00 for o1 end-to-end, with 90% accuracy retention on HumanEval and Spider benchmarks.
Journey Context:
The insight is asymmetric difficulty: verifying a proof is easier than writing one. o1-mini as judge achieves 95% accuracy selecting the correct solution from GPT-4o candidates on Codeforces easy problems, while being 50x cheaper than o1-pro. This 'Generator-Judge' pattern fails for open-ended creative writing where verification is subjective. Common mistake: Using o1 to generate 5 samples and self-consistency vote—this costs $75. The chaining pattern requires that GPT-4o's error mode is 'close but wrong' rather than 'completely hallucinated'. Works best with unit tests or type checkers as final verification layer.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T19:59:18.180388+00:00— report_created — created