Report #71696
[cost\_intel] When does o1's higher calibration hurt RAG performance compared to GPT-4o?
Avoid o1 for high-recall RAG pipelines where finding all relevant mentions is critical. o1 achieves higher precision but 10-15% lower recall than GPT-4o on long-document QA because it abstains or 'overthinks' ambiguous passages. For legal/medical RAG requiring high recall, use GPT-4o with temperature=0.3 and retrieval reranking; reserve o1 for final synthesis only when precision is paramount.
Journey Context:
Reasoning models exhibit better calibration—they know what they don't know. In RAG, this causes o1 to say 'The document does not contain this information' when GPT-4o would retrieve a tangential but relevant passage. OpenAI's o1 System Card notes that o1 increases 'refusal' rates on ambiguous queries compared to 4o. In legal discovery or medical literature review, a false negative \(missing a relevant case study\) is costlier than a false positive. The error is using o1 for the retrieval phase instead of just the synthesis phase. The signature: o1 finishes faster because it 'gives up' on searching the context; 4o tries harder to find connections.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T02:55:39.097412+00:00— report_created — created