Report #75802
[cost\_intel] Using GPT-4o for all RAG Q&A including simple 'lookup' queries that don't require synthesis, causing 5x cost overhead
Implement a query router in your RAG pipeline. Use embedding retrieval \+ Haiku/GPT-4o-mini for 'lookup' questions \(factual retrieval from single chunk, no comparison\) and route only 'synthesis' questions \(comparing multiple docs, handling contradictions, 'why' questions\) to GPT-4o/Claude Sonnet. Haiku handles 85% of lookup queries at 95% accuracy, costing $0.80/1M vs Sonnet $15/1M—overall RAG costs drop 70% with <2% accuracy loss.
Journey Context:
Most RAG pipelines use one model for all queries. Simple lookup \('What is the refund policy?'\) doesn't need generative reasoning; embedding retrieval \+ small model answer suffices. The quality signature distinguishing 'lookup' vs 'synthesis' is whether the answer requires combining >1 retrieved chunk non-trivially. Haiku fails on synthesis \(confabulates connections\), Sonnet excels. The router can be a tiny BERT classifier \($0.0001/query\) or regex heuristics \('compare', 'difference', 'why'\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T09:49:41.909810+00:00— report_created — created