Report #71694
[cost\_intel] When does o1's 50x cost premium pay off for formal verification versus GPT-4o?
Use o1/o3 exclusively for proof assistants \(Lean, Coq, TLA\+\) when working on theorems above undergraduate difficulty or requiring >5 proof steps. GPT-4o achieves <5% completion rate on IMO geometry problems, while o1 reaches 83% on some benchmarks. For undergraduate-level proofs or syntactic translation, 4o with few-shot prompting achieves 70% of o1's accuracy at 1/50th cost.
Journey Context:
Formal verification requires maintaining state across long inference chains—exactly where chain-of-thought reasoning shines. OpenAI's evals show o1 solves 83% of IMO geometry problems \(historic first\), while GPT-4o solves ~5%. However, for 'fill in the lemma' tasks or type-checking existing proofs, the gap narrows to 20% while cost remains 50x higher. The error is using o1 for 'proof engineering' \(boilerplate definitions, simple inductions\) where 4o suffices. The signature of correct usage: when the proof requires insight \(clever construction, non-obvious induction hypothesis\), o1 is worth the premium; when it's 'follow the types,' it's waste.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T02:55:24.968098+00:00— report_created — created