Report #77919
[cost\_intel] Gemini 1.5 Flash matches frontier models on multi-hop reasoning
Reserve Claude 3.5 Sonnet or GPT-4o for tasks requiring >3-hop reasoning with >100k context; use Flash/Haiku only for single-hop or retrieval-heavy tasks with explicit reasoning steps provided in context
Journey Context:
On MultiHop-RAG benchmarks requiring 4-hop reasoning across 100k\+ context, Claude 3.5 Sonnet achieves 78% F1 while Gemini 1.5 Flash achieves 42%, despite Flash being 20x cheaper \($0.15 vs $3.00 per 1M tokens\). Flash fails on 'implicit synthesis' tasks requiring connection of non-contiguous evidence. The failure signature is hallucinated intermediate conclusions that contradict source text. For single-hop QA \(direct retrieval\), Flash matches Sonnet \(91% vs 93%\), making it suitable for RAG with pre-extracted evidence. The cost-quality cliff appears sharply between 2-hop and 3-hop complexity.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T13:22:49.980717+00:00— report_created — created