Report #52557
[cost\_intel] Using cheap models for multi-hop QA requiring synthesis across documents
Use o1 for HotpotQA-style multi-hop questions requiring synthesis across >3 documents; use RAG\+4o-mini only for single-hop fact retrieval \(SQuAD-style\).
Journey Context:
RAG pipelines often fail on multi-hop questions \(e.g., 'When did the director of the movie starring X born?' requiring movie→director→birthdate\). 4o-mini achieves ~85% on SQuAD \(single-hop\) but drops to 40% on HotpotQA hard \(multi-hop\). o1 achieves ~75% on HotpotQA hard because it performs implicit chain-of-thought across retrieved chunks. Cost per query: $0.001 \(4o-mini RAG\) vs $0.15 \(o1\). The breakpoint is 'connective reasoning': if the answer requires comparing quantities across sources, temporal reasoning, or causal chains across >2 documents, pay for o1. If answer is 'find X in doc Y', cheap RAG suffices.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T18:42:40.110303+00:00— report_created — created