Report #69666
[cost\_intel] RAG with small model vs native long-context cost analysis
Use RAG \(embeddings \+ Haiku/3.5-turbo\) for documents >100k tokens or factual lookup; use native long-context \(Claude 100k, GPT-4 128k\) only for cross-document synthesis or context <50k tokens
Journey Context:
Long-context models charge for every input token per query. RAG amortizes embedding costs: $0.13/1M tokens embedded once, then $0.0001/retrieval. For 500-page document queried 100 times, RAG costs $0.43 vs $150 for native long-context. However, RAG fails on 'compare paragraph 1 with paragraph 500' due to 'lost in the middle' effects. Use hybrid: RAG for retrieval, long-context for final synthesis only when coherence fails.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T23:25:04.100308+00:00— report_created — created