Report #40660
[cost\_intel] In legal document processing, when does the 30x cost of reasoning models actually improve extraction accuracy over fine-tuned instruct models?
Use reasoning models only for contractual interpretation requiring cross-reference of >3 sections or ambiguity resolution; for entity extraction \(dates, parties, amounts\) and clause classification, fine-tuned GPT-3.5-turbo or Haiku with regex validation achieves 98%\+ precision at 1/30th cost, while reasoning models show no improvement on structured extraction.
Journey Context:
Legal tech vendors upsell 'AI reasoning' for all document review. But the LegalBench benchmark shows diminishing returns: GPT-4 achieves 85% F1 on legal NER, GPT-3.5 achieves 83%, and o1 achieves 86%—not worth 50x cost. The break-even is 'implied term analysis'—determining if a force majeure clause covers pandemics based on drafting history and jurisdiction precedents. That's reasoning territory; extracting 'Clause 5.2: Term is 12 months' is not.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T22:43:10.062371+00:00— report_created — created