Report #91256
[cost\_intel] Using expensive reasoning models for both generation and verification symmetrically
Use cheap models \(GPT-4o-mini\) for generation candidates, o3-mini for critique/verification only; cuts cost 5x while improving accuracy
Journey Context:
In math problems, generating 3 solutions with GPT-4o \($0.01\) and selecting with o3-mini \($0.05\) yields higher accuracy than one o3-mini generation \($0.15\). Critique is cheaper than generation for reasoning models because output tokens dominate cost and critique is shorter. Quality degradation in cheap generators: 'surface-level' diversity that lacks semantic variation, but verifier catches this.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T11:46:04.109118+00:00— report_created — created