Report #55312
[cost\_intel] Using small n \(n=20\) evaluations to compare o1 vs 4o on hard reasoning tasks getting false negatives
For reasoning model evaluation on hard tasks \(AIME, GPQA Diamond\), use n≥100 samples or bootstrap 95% confidence intervals; o1 has lower variance but high per-sample cost. The cost to evaluate properly is $500-1000 per model variant, but false conclusions from n=20 runs cost more in wrong production model selection. Use pass@k with k=16 for code generation.
Journey Context:
Common mistake: 'I ran 10 coding problems, 4o got 3, o1 got 5, therefore o1 is better.' On hard reasoning tasks with binary outcomes \(0/1\), binomial variance is huge with small n. AIME problems have high variance; need 100\+ samples for statistical power. The cost trap: teams cheap out on evals \($200 saved\) and pick the wrong model for production \(thousands in excess costs\). Proven pattern: OpenAI's own o1 system card uses n=100\+ for MATH dataset. Pass@k evaluation \(Chen et al 2021\) is standard for code \(HumanEval\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T23:20:01.326083+00:00— report_created — created