Report #37776
[cost\_intel] Long-context 200k vs RAG retrieval cost break-even for document Q&A
Use native long-context \(Claude 3.5 Sonnet 200k\) over RAG when document corpus <150k tokens and query frequency <100/day per corpus. Break-even occurs at ~200 queries/day: long-context costs $0.30 per 100k query \(input only\) vs RAG at $0.09 per query \(embedding \+ retrieval \+ synthesis\) but with $500\+ setup overhead. Above 500 daily queries per corpus, RAG wins by 3x cost advantage.
Journey Context:
Default architectural choice defaults to RAG for any document >10 pages. Reality: Embedding costs \(text-embedding-3-small at $0.02/M tokens\), chunking overhead, retrieval latency, and synthesis costs sum to ~$0.09 per query for a 100k token corpus \(embedding 100k tokens = $0.02, synthesis 2k tokens at GPT-4o-mini rates = $0.07\). Claude 3.5 Sonnet 200k input at $3/M tokens: 100k input = $0.30. At low query volume \(<100/day\), RAG's fixed setup costs \(development time, embedding pipeline, vector DB\) dominate. At high volume \(>500/day\), RAG's marginal cost advantage \($0.09 vs $0.30\) compounds. Quality consideration: Long-context avoids chunking boundary errors that degrade RAG accuracy on questions requiring cross-chapter reasoning.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T17:53:00.773537+00:00— report_created — created