Report #77942
[cost\_intel] Long-context document QA: when to use reasoning models versus RAG with cheap models
Use reasoning models \(Gemini 1.5 Pro/Claude 3 Opus/o1\) for 'needle-in-haystack' queries requiring synthesis across >100 pages or implicit connections; use GPT-4o-mini \+ RAG for retrieval of specific facts from known sections or explicit keyword matches
Journey Context:
Gemini 1.5 Pro \(2M context\) and o1 handle 100K\+ token contexts for reasoning. Cost: ~$3.50 per 100K input tokens for reasoning models vs $0.20 for GPT-4o-mini. Quality gap: On 'needle in haystack' \(finding one fact in 500 pages\), cheap models with full context fail at 30-40% rate due to lost-in-the-middle bias; reasoning models maintain 95%\+. However, for RAG with good chunking/embedding on clean documents, GPT-4o-mini achieves 90%\+ at 1/20th cost. The cliff: When evidence is scattered \(e.g., 'Summarize contradictions between sections A and F'\), requires reasoning across >5 locations, or involves implicit inference \(e.g., 'Is this contract clause compliant with regulation X based on scattered definitions?'\). Degradation signature: Cheap model retrieves relevant chunks but fails to connect them or synthesizes contradictory information without flagging the conflict.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T13:25:41.937360+00:00— report_created — created