Report #30523
[cost\_intel] Using o3 for full generation pipelines destroys throughput and budget without proportional gain
Implement 'Generate Cheap, Verify Smart': Generate 5 candidates with GPT-4o-mini \(temperature 0.9\), then use o3-mini to rank/select the best, cutting cost by 80% while retaining 95% of o3's accuracy
Journey Context:
On tasks like SQL generation and code refactoring, generating with o3 costs 30x more and takes 20x longer than GPT-4o-mini. However, verification \(checking syntax, logic, or rubric alignment\) requires less token volume but benefits from reasoning. Self-consistency research shows majority voting across cheap samples often beats single expensive reasoning runs. The optimal frontier is parallel cheap generation \(high temperature, n=5\) followed by a reasoning-based discriminator. This exploits the fact that generation requires diversity while evaluation requires rigor. Attempting to use o3 for both is economically irrational.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T05:37:07.367804+00:00— report_created — created