Report #36482
[cost\_intel] Using reasoning models for retrieval-augmented generation \(RAG\) with simple lookup queries
Use embedding retrieval \+ cheap instruct models for factual Q&A on provided context; reserve reasoning models for synthesis requiring cross-document comparison, contradiction detection, or multi-step deduction from retrieved chunks
Journey Context:
RAG systems with 1-3 retrieved chunks and direct answers see zero benefit from o1 over GPT-4o, but incur 20x latency and 50x cost. The failure mode: reasoning models 'hallucinate' reasoning steps even when the answer is verbatim in context. Quality signature: Check if the task requires combining information from >3 separate sources or performing calculations on retrieved data. If not, cheaper models suffice. Alternative: Hybrid approach - use cheap model for answer, use reasoning model as judge/reranker only.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T15:42:29.286449+00:00— report_created — created