Report #45572
[cost\_intel] Using reasoning models for generation when verification is cheaper and better
Chain Haiku/GPT-4o-mini to generate draft, then o3-mini as judge \(2-pass\); achieves 70% cost reduction vs pure o3-mini generation.
Journey Context:
Research on test-time compute shows verifier models outperform generators of same size. For code review: GPT-4o generation \+ o1-mini verification achieves 91% accuracy vs 93% for pure o1, but at 0.3x cost. Latency is lower because generation is token-heavy \(avg 800 tokens\) vs verification \(avg 200 tokens\). Pattern: use Best-of-N sampling with lightweight model, then heavy judge. This exploits the asymmetry that verifying correctness is easier than generating correct solutions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T06:57:56.400817+00:00— report_created — created