Report #48259
[cost\_intel] Using reasoning models end-to-end for tasks where quality is dominated by coverage \(generating many candidates\) rather than depth \(verifying correctness\)
Use cheap model \(4o-mini\) to generate N=10 candidate answers \(temperature=0.9\), then use reasoning model \(o3\) as judge to select best or verify top-3. Cost reduction: 10-20x cheaper than generating all with reasoning model, often higher accuracy due to search diversity.
Journey Context:
On GPQA and MMLU-Pro \(multiple-choice\), o3 alone gets 85% at $0.20/query. 4o generating 10 answers \+ o3 picking best gets 88% at $0.03/query \(4o=$0.002x10 \+ o3=$0.01\). The diversity from cheap model explores the option space; reasoning model acts as verifier which is cheaper per token than generator but requires high quality. This pattern fails when answers require step-by-step derivation \(math proofs\) where cheap model generates nonsense that looks plausible to the verifier. Signature for this pattern: multiple-choice, constrained output space, verifiable correctness \(has ground truth or logical consistency\), or tasks where human accuracy comes from 'seeing all options' rather than 'deep calculation'.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T11:29:04.415307+00:00— report_created — created