Report #78396
[cost\_intel] Long context multi-document reasoning: o1 vs GPT-4o accuracy
Use o1/o3 for multi-document synthesis >100k tokens with complex cross-document dependencies \(legal contracts, research synthesis\); GPT-4o misses cross-references and 'loses the thread' despite having the context window.
Journey Context:
GPT-4o's 128k context window is shallow—performance degrades on 'needle in haystack' tasks requiring multiple hops. o1 maintains higher accuracy on long-context reasoning \(e.g., legal contract comparison across 50 docs\). The cost is 20x but necessary when missing a cross-reference is expensive. The degradation signature: GPT-4o hallucinates connections or misses contradictions on page 50 vs page 5, while o1 maintains the reasoning chain across the full context.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T14:10:59.914175+00:00— report_created — created