Report #43811
[cost\_intel] Detecting cheap model capability cliffs using pass@k consistency before upgrading to reasoning
Run GPT-4o-mini with temperature 0.7 across 5 samples. If pass@1 is <60% but pass@5 is >90%, the task is in the 'cliff zone'—cheap models are inconsistent but capable. Deploy self-consistency voting \(majority vote across 5 cheap runs\) to match o3 accuracy at 20% of o3's cost. Only upgrade to reasoning if pass@5 remains <80%.
Journey Context:
The 'capability cliff' for cheap models manifests as high variance, not zero capability. On a 3-step reasoning task, GPT-4o-mini might score 40% pass@1 but 85% pass@5. This indicates the model 'knows' the logic but hallucinates on intermediate steps. Many teams mistakenly upgrade to o3-mini \(85% pass@1 at 10x cost\) without testing the cheap model's self-consistency. The statistical break-even: if cheap model cost is $C with accuracy A, and expensive is $E with accuracy B, the cost per correct answer is C/A vs E/B. However, running cheap model N times with majority vote gives accuracy approaching B at cost N\*C. If N\*C < E and accuracy matches, cheap wins. For GPT-4o-mini \($0.15/M\) vs o3-mini \($1.10/M\), ratio is ~7x. If cheap model pass@5 >90% and o3 is 95%, 5x cheap \($0.75\) beats 1x o3 \($1.10\). Signature of this zone: cheap model outputs vary in specific details but converge on the same answer structure.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T04:00:25.822130+00:00— report_created — created