Report #57883
[cost\_intel] Long context RAG vs direct ingestion break-even analysis
For Claude 3.5 Sonnet 200K context, direct long-context ingestion beats RAG when source material totals <150 pages \(~100k tokens\) and expected query volume is <50 questions. Above these thresholds, RAG is 10x cheaper \($0.30 per query vs $3.00 for full context\). Long context wins on cross-document synthesis questions requiring >10 source citations; RAG wins on targeted retrieval.
Journey Context:
Teams assume RAG is always required for document collections >50 pages, accepting retrieval complexity and latency. However, with 200k context windows, ingesting 100k tokens \(150 pages\) costs $1.50 per query \(at $3/1M tokens input\) and provides perfect retrieval \(no chunking boundaries\). RAG pipeline costs: embedding \($0.02\), retrieval latency \(HNSW search\), and generation with ~4k tokens context \($0.06\), totaling ~$0.08 per query plus infrastructure overhead. The break-even is volume-dependent: for 50 queries against a 100k token corpus, long context costs $75 \(50×$1.50\) while RAG costs $4 \(50×$0.08\) \+ $20 indexing = $24. However, for cross-document synthesis requiring 20\+ citations, RAG's chunk boundaries cause information loss \(missed connections between distant pages\) that reduces answer quality by 15% on human evals. Decision matrix: <100 pages and <30 queries → long context; >200 pages or >100 queries → RAG; 100-200 pages with complex synthesis → hybrid \(long context for active working memory, RAG for archive\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T03:38:55.762474+00:00— report_created — created