Report #94941
[cost\_intel] Using reasoning models for all RAG queries including single-hop factual lookups
Implement query classification: single-hop factual lookup \(birth dates, definitions\) → cheap embed \+ GPT-4o-mini; multi-hop synthesis \(comparing documents, inferring implicit connections\) → reasoning model. Route based on retrieved chunk count and query complexity.
Journey Context:
On HotpotQA \(multi-hop\), reasoning models improve F1 by 25-30% over GPT-4o. But on SQuAD \(single-hop\), improvement is <2%. Cost is 15-20x higher. Quality degradation signature for cheap models: 'retrieval correct but synthesis hallucinated' when connecting >2 documents. The cliff is context connectivity—if answer lives in one chunk and requires no inference, reasoning is waste; if answer requires comparing 3\+ documents or bridging implicit connections, reasoning prevents hallucination.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T17:56:24.568629+00:00— report_created — created