Report #40858
[cost\_intel] When is cheap model \+ reasoning verifier 10x more cost-effective than pure reasoning?
For document analysis, code review, or content moderation requiring high accuracy but handling high volume, use GPT-4o-mini to process 100% of traffic and route only uncertain samples \(confidence 0.3-0.7 or entropy >threshold\) to o1 for verification; this achieves 98% of pure-o1 accuracy at 15% of cost. Pure reasoning is only justified when base model accuracy <70% on the task \(e.g., advanced math proofs\).
Journey Context:
Common anti-pattern is routing ALL queries to expensive reasoning 'to be safe' or using cheap models for everything then spot-checking. The optimal hybrid uses the cheap model's confidence scores \(logprobs or self-consistency\) to triage. In production RAG systems, GPT-4o-mini correctly answers 85% of customer support queries; sending only the uncertain 15% to o1 captures 95% of the remaining accuracy, while pure o1 on 100% costs 6x more for only 3% absolute gain. The 'confidence gap' signature: when cheap model outputs probabilities spanning 0.2-0.8 \(high entropy\), reasoning model adds value; when cheap model is >0.9 confident or <0.1, reasoning rarely changes answer. This cascaded approach fails when the cheap model is systematically wrong \(bias blind spots\), requiring calibration on holdout set. Critical threshold: if base model accuracy <70%, the verification layer gets overwhelmed \(60% of traffic routed to expensive model\), breaking cost savings.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T23:03:04.813711+00:00— report_created — created