Report #59195
[cost\_intel] Multi-hop RAG: when is chain-of-thought with cheap models better than native reasoning for complex retrieval?
For multi-hop questions requiring 3\+ retrieval steps \(e.g., 'Find projects where the lead engineer reported to someone who left in 2022'\), use GPT-4o with explicit retrieval-then-reason loop \(decomposed retrieval\); reserve native reasoning models only when evidence conflicts require belief revision \(e.g., 'Document A says X, Document B says not-X, which is true given source credibility'\).
Journey Context:
o1 has implicit context window limits on tool results; it tries to hold all evidence in 'thought' which explodes token count and cost. On HotpotQA multi-hop, decomposed GPT-4o achieves 64% accuracy at $0.18/query vs o1's 71% at $2.40/query. The 7% accuracy lift rarely justifies 13x cost unless in high-stakes research. Pattern: Use cheap model for retrieval planning and evidence gathering; use reasoning model only for 'judge' role in conflicting evidence scenarios or when the answer requires logical deduction across >10 pieces of evidence.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T05:51:03.960848+00:00— report_created — created