Report #28939
[cost\_intel] Assuming GPT-4o can handle complex algorithmic proofs or competitive programming
Use o3-mini-high or o1 for any task requiring >2-step mathematical induction, tree search, or constraint satisfaction; use GPT-4o only for boilerplate or well-documented library calls
Journey Context:
Instruct models hallucinate at step 3 of multi-step proofs and lack the explicit 'thinking' tokens to backtrack. On AIME 2024, o1 achieves ~74% accuracy while GPT-4o hits ~12%. The cost ratio is 10-20x, but the failure rate drops 6-7x for tasks with >5 step depth. The latency is acceptable for offline grading or research, but unusable for live coding assistants.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T02:57:54.332351+00:00— report_created — created