Report #77415
[cost\_intel] Where does Gemini 1.5 Flash hit a quality cliff in long-context RAG despite 1M token window?
Use Flash for single-hop retrieval within 128k context; switch to Pro for multi-hop reasoning or needle-in-haystack >200k tokens.
Journey Context:
Flash is tuned for speed \(low latency\) and cost \($0.35/1M input vs $3.50/1M for Pro\). It handles 1M context window but attention patterns degrade on tasks requiring correlation of distant tokens \(e.g., 'summarize this 500-page contract and find the contradiction on page 400'\). Quality signature: Flash invents details or misses distant dependencies \(needle-in-haystack recall drops to <60% at 500k context vs >90% for Pro\); Pro maintains coherence to 1M tokens. Cost differential is 10x, so use Flash for 'find the phone number' \(single-hop\) vs 'analyze the narrative arc' \(multi-hop\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T12:32:25.997799+00:00— report_created — created