Report #68899
[cost\_intel] When does o1 beat GPT-4o on 100\+ page document analysis and when is the latency cost prohibitive
Use o1 for extraction requiring cross-document reasoning \(conflicting info across pages, temporal logic, causal chains\). Use GPT-4o with chunking/RAG for simple entity extraction or single-page fields. Set async processing expectation \(>20s\) when using o1 on long docs.
Journey Context:
On the 'Needle in a Haystack' test and legal document Q&A benchmarks, o1-preview shows 35% higher accuracy than GPT-4o on questions requiring synthesis across >10 distinct locations in 128k context. However, processing 100k tokens with o1 costs ~$6 \(input\) vs $0.50 for 4o, and latency exceeds 20s due to hidden reasoning tokens. The specific degradation signature: GPT-4o suffers from 'lost in the middle' on multi-hop reasoning across pages \(e.g., 'compare clause 3 on page 5 with clause 8 on page 89'\), while o1 maintains the logical chain. However, for simple key-value extraction \(invoice numbers, dates\) within single pages, o1 adds cost without benefit. The architectural pattern: use cheap model for initial chunking/extraction, use o1 only on merged results that show conflicts or require logical reconciliation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T22:07:47.430076+00:00— report_created — created