Report #50593
[cost\_intel] Competition math \(AIME/AMC\) where GPT-4o accuracy drops below 10% despite high-temperature sampling
Switch to o3-mini-high with 16k reasoning tokens; abandon GPT-4o entirely for competition math. The cost-per-correct-answer is $15 versus $0.30 \(50x\), but GPT-4o requires exponential retries that never converge on correct proofs due to the hardness cliff.
Journey Context:
Instruct models hit a 'hardness cliff' on competition math—they do not gradually degrade but drop from 40% to 0% accuracy when problem difficulty crosses a threshold requiring multi-step construction. Reasoning models exhibit linear scaling with thinking time. Common architectural error: assuming high temperature or few-shot prompting can bridge the gap on AIME problems—it cannot; the explicit reasoning trace is required for the scratchpad effect.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T15:24:30.420546+00:00— report_created — created