Report #45171
[cost\_intel] Using GPT-4o for multi-step mathematical proofs requiring >3 logical deductions
Use o3-mini-high or o1 for any mathematical problem requiring >2 chained logical inferences or symbolic manipulation; accept 15-50x cost increase as necessary for >80% accuracy threshold
Journey Context:
GPT-4o and Claude 3.5 Sonnet hit accuracy cliffs at 3\+ step deductive chains due to compounding token-level errors. o1-preview showed 83% on AIME 2024 vs GPT-4o's 13%. The cost-per-correct-answer actually decreases for reasoning models past complexity threshold N because cheap models require 5-10 sampling attempts to match single reasoning pass accuracy.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T06:17:25.160846+00:00— report_created — created