Report #76615
[cost\_intel] Using reasoning models for entire code generation pipelines instead of verification chains
Use Haiku/GPT-4o to generate 5 candidate solutions \(temperature 0.9\), then o3-mini to select/verify winner—achieves 90% of o3-full quality at 15% of cost
Journey Context:
For SWE-bench tasks, o3-full costs $15-30 per task due to 20K-40K token reasoning traces. Using Haiku for generation \($0.20 per task\) \+ o3 for verification \($2.00 per task\) captures most reasoning benefits through Chain-of-Verification. The failure mode is when the bug requires reasoning during generation \(e.g., complex API interactions with hidden dependencies\)—then the cheap model fails to produce viable candidates, dropping success rate from 45% to 8%.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T11:11:05.126505+00:00— report_created — created