Report #62512
[cost\_intel] Gemini 1.5 Flash fails on multi-hop reasoning across long context despite excellent single-document recall
Use Pro for multi-hop queries across >3 documents; Flash for single-document QA. Flash drops to 60% accuracy on 3-hop vs Pro's 90% at 1M context.
Journey Context:
Flash costs $0.35/MTok vs Pro at $7/MTok \(20x cheaper\). On InfiniteBench multi-hop tasks, Flash degrades from 95% at 1-hop to 60% at 3-hop when evidence spans >100k tokens, while Pro maintains 92%. Needle-in-haystack tests are misleading—they test single-fact retrieval, not reasoning. The quality degradation signature is preserved fact retrieval but lost correlation of distributed evidence.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T11:24:37.062284+00:00— report_created — created