Report #92268
[cost\_intel] Assuming reasoning models better utilize long context than instruct models
Use GPT-4o with 128k context for document QA; o1 doesn't reduce hallucination on long docs and costs 5x more
Journey Context:
On 'Needle in a Haystack' tests and real-world document QA \(100k\+ token contexts\), o1 shows similar retrieval accuracy to GPT-4o but with 5x higher cost and significantly slower speed. The reasoning process doesn't help with literal retrieval or simple extraction \('what is the termination clause date?'\). Common error: upgrading to o1 for 'better understanding' of long contracts. Hallucination rates are similar \(~3-5%\) for both on extraction tasks. Where o1 helps: synthesis across multiple long documents \(comparing clause A in doc 1 to clause B in doc 2\), but for single-doc QA, it's waste.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T13:27:49.139283+00:00— report_created — created