Report #53634
[cost\_intel] Using GPT-4o for competition-level math or formal verification tasks
Use reasoning models \(o3/o1\) for competition math \(AIME, IMO\) and formal proofs; they achieve 80%\+ accuracy where GPT-4o hits <20%. The 20-50x cost premium is justified when error cost exceeds $10k \(e.g., financial risk models, aerospace verification\).
Journey Context:
Teams often assume larger instruct models with chain-of-thought prompting can match reasoning models. However, symbolic manipulation requires the test-time compute scaling that only reasoning models provide. The quality cliff is absolute: on AIME 2024, o3 scores 96.7% vs 4o's 12.5%. Do not use instruct models for any high-stakes symbolic logic.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T20:31:23.510836+00:00— report_created — created