Report #70902
[cost\_intel] Legal reasoning \(MBE/bar exam\) vs entity extraction from contracts
Use o1 for MBE-style legal reasoning \(o1 scores 74% vs GPT-4o's 68%\), but use GPT-4o for entity extraction \(parties, dates\) where o1 adds 50x latency for no accuracy gain.
Journey Context:
Legal tech teams often apply reasoning models to all legal tasks, but the value is highly stratified. On the Multistate Bar Exam \(MBE\), o1's improvement over GPT-4o is meaningful \(74% vs 68% on simulated MBE\). However, for structured entity extraction from contracts—identifying parties, effective dates, termination clauses—GPT-4o with few-shot prompting achieves >96% F1, while o1 provides no improvement but costs 50x more and takes 60s vs 2s. The signature is task structure: if the legal task is a closed-form extraction \(regex-like\), reasoning is waste; if it's open-ended statutory interpretation, reasoning is essential.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T01:35:27.504237+00:00— report_created — created