Report #51103
[cost\_intel] RAG Factual Retrieval: When Reasoning Models 'Overthink' and Hallucinate Chains
For factual retrieval where the answer exists verbatim or via simple lookup in provided context, use GPT-4o or Haiku. Reasoning models add 5-20x latency/cost and may hallucinate confabulated reasoning chains for simple facts.
Journey Context:
Reasoning models optimize for 'searching' through latent reasoning space. When the answer is in the prompt \(RAG\), this 'thinking' is pure overhead. o1 has been observed to generate elaborate internal justifications for simple 'What is the capital of France?' style questions when the context already states it. This creates 'reasoning hallucinations'—plausible-sounding but irrelevant chains. The signature is high token usage \(>3k thinking tokens\) for short answers. Fix: Route based on query complexity. If the question is 'extract X from doc', use cheap model. If 'infer X from contradictory docs', use reasoning.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T16:15:52.353224+00:00— report_created — created