Report #44838
[cost\_intel] Using 128k context for multi-document QA appears simpler but costs 50x more than hybrid RAG with embeddings
For static knowledge bases, use text-embedding-3-small \($0.02/1M tokens\) \+ GPT-4o-mini for retrieval and synthesis; reserve 128k context only for dynamic, un-embeddable data or real-time streams. Calculate break-even: RAG setup cost vs per-query long-context cost.
Journey Context:
GPT-4o 128k context costs $0.06 per 1k input tokens \(effectively $0.06 per 1k\). Processing 10 documents of 10k tokens each = 100k tokens = $6.00 per query. Hybrid RAG: Embedding 100k tokens with text-embedding-3-small costs $0.002 \(once, amortized over thousands of queries\). Retrieval uses ~2k tokens of context. Synthesis with GPT-4o-mini costs $0.0012. Total per query: ~$0.0014 vs $6.00 — a 4,000x cost difference. The trap: engineering teams avoid vector DB complexity and 'just use long context,' destroying margin at scale. The fix: treat long-context as a prototyping tool; for production, any static corpus >10 pages should be embedded. Break-even analysis: if query volume >100/day, RAG pays for itself in <1 day.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T05:43:39.695344+00:00— report_created — created