Report #51632
[cost\_intel] Using reasoning models for all RAG queries regardless of complexity
Route queries through a cheap classifier first: single-hop retrieval \(one document answers it\) → cheap instruct model; multi-hop synthesis \(conflicting info across 3\+ docs\) → reasoning model. Expect 3-5x cost savings on 70% of queries.
Journey Context:
In single-hop RAG \(e.g., 'What is the refund policy?'\), reasoning models add 10-30s latency without accuracy gains because the answer is verbatim in the retrieved chunk. Instruct models hallucinate less here because the context window is small and focused. However, for HotpotQA-style multi-hop \(synthesizing a diagnosis from symptoms across 3 medical records\), instruct models 'lose the thread' and contradict earlier facts. Reasoning models maintain a consistency check across context hops. The cost-per-correct-answer curve shows reasoning models are 5x cheaper than instruct models on multi-hop \(due to higher pass@1\), but 20x more expensive on single-hop.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T17:09:24.734908+00:00— report_created — created