Report #79929
[cost\_intel] Identifying capability cliffs via pass@k curves
If GPT-4o achieves <40% pass@1 but >70% pass@8 on your task, use o1; if pass@1 >70%, use GPT-4o-mini with majority voting; if pass@8 <50%, no model currently solves it cost-effectively.
Journey Context:
The 'capability cliff' is identified by measuring pass@k \(fraction solved given k attempts\). Flat curve \(pass@1 ≈ pass@8\) means the model fundamentally lacks the capability—using o1 won't help \(it just burns money\). Steep curve \(pass@8 >> pass@1\) means the model can solve it but needs sampling—here o1's test-time compute is better than 4o sampling. Common error: using o1 when 4o pass@8 is 95%—you could use 8x4o calls for 1/4 the price of o1.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T16:45:40.988992+00:00— report_created — created