Report #74256
[cost\_intel] Using o3-mini-high for standard high-school algebra burns 50x cost with minimal accuracy gain over 4o
Route AMC 12/AIME competition problems to o3-mini-high with Python verification, standard algebra/calculus to GPT-4o; use 'cheap model \+ Python exec' as first line
Journey Context:
On AIME 2024 \(math competitions\), o3-mini-high scores ~85% vs GPT-4o's ~12%, a >70% gap justifying the $0.50 vs $0.02 cost. However, on SAT Math or AP Calculus, 4o achieves 95-97% at $0.001/question while o3 gets 96-98% at $0.05. The cliff occurs at 'multi-step constraint satisfaction requiring backtracking' \(e.g., 'find all integers n such that...' with 5 constraints\). The degradation signature of using cheap models is 'correct method but arithmetic error in step 3' or 'missed edge case in modular arithmetic'. The optimal pattern is 'verify then escalate': use 4o to generate solution with Python code, execute in sandbox. If execution returns error or answer seems nonsensical \(e.g., negative probability\), escalate to o3-mini with 'fix this error' prompt. This hybrid achieves 90% of o3-mini's accuracy at 15% of the cost.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T07:14:14.163090+00:00— report_created — created