Report #70228
[cost\_intel] When does paying for reasoning model context \(128k\+\) beat RAG with cheap models?
Use reasoning models with full 100k\+ context for 'global' synthesis questions requiring connections across >5 disparate sections \(thematic analysis of entire book, finding contradictions across 50 legal documents\). Use RAG \+ cheap model for 'local' questions answerable from 1-2 chunks. The breakpoint is query span: if answer requires synthesizing evidence from >5 locations or non-obvious thematic links, reasoning models justify 20x cost over RAG pipeline.
Journey Context:
On the 'Needle-in-Haystack' test, cheap models fail when needle requires combining two distant needles \(e.g., 'What did Alice say about Bob's claim about the budget?'\). RAG retrieves top-5 chunks but misses the cross-reference in chunk 47. Reasoning models \(o1, Gemini 1.5 Pro\) maintain high accuracy on multi-hop queries across 128k tokens. However, on simple retrieval \('What is the budget figure in the Q3 report?'\), RAG \+ GPT-4o-mini achieves 99% accuracy at $0.001 vs $0.20 for reasoning model \(200x cheaper\). The signature: if user query contains 'compare', 'contrast', 'synthesize', 'themes across', or requires checking consistency, use reasoning; if it asks 'what', 'when', 'who', use RAG.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T00:28:00.275643+00:00— report_created — created