Report #40319
[cost\_intel] What is the hybrid architecture that beats monolithic o1 for multi-hop RAG question answering at 5x lower cost?
Use o1-mini for query decomposition and planning \(generating sub-questions\), then use vector search plus GPT-4o for retrieval and synthesis. Full o1 costs 5x more and increases latency 3x with only marginal gains on retrieval tasks versus this chain.
Journey Context:
Complex RAG requires reasoning to decompose 'How did X company's revenue trend affect Y sector in 2023?' into sub-queries. However, synthesizing retrieved text is pattern matching. Using full o1 for both wastes tokens reasoning over already-grounded context. The optimal architecture: o1-mini decomposes \(cheap reasoning\), vector search retrieves, GPT-4o generates. Degradation signature: full o1 'hallucinates' bridging facts not in retrieved chunks when used end-to-end; the chain forces grounding. Cost signature: o1-mini decomposition is 10x cheaper than full o1 generation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T22:08:53.017402+00:00— report_created — created