Report #67886
[cost\_intel] Cost crossover point where 100k context window stuffing beats RAG retrieval for Q&A tasks
Use full context stuffing \(no RAG\) when source material is <80k tokens and query volume is <100/day. Break-even analysis: RAG pipeline \(embedding \+ storage \+ retrieval\) has fixed infra cost ~$200/mo. At GPT-4o pricing \($0.005/1k input\), 80k tokens × 100 requests = $40/day = $1200/mo. RAG reduces to 4k retrieved chunks = $60/mo tokens \+ $200 infra = $260. Below 100 requests/day, stuffing is cheaper and higher quality \(no retrieval loss\). Above 1k requests/day, RAG is mandatory. Critical: Stuffing requires 128k context model; use 4k retrieved chunks with 8k context for smaller models.
Journey Context:
Default engineering instinct: 'Use RAG for everything with documents.' This ignores the fixed cost of vector DBs \(Pinecone/Weaviate\) and embedding pipelines. For small permanent contexts \(company handbook, legal briefs <100 pages\), stuffing the full text into 128k context is simpler and cheaper at low volume. The quality cliff is retrieval failure \(missing the relevant chunk due to embedding semantic mismatch\). Stuffing eliminates this. The cost cliff is linear with request volume - at high volume, you're paying $0.005 per 1k tokens repeatedly for the same context. RAG amortizes context cost across requests. Hybrid approach: Use RAG for initial filtering \(recall\), then stuff top-5 chunks into context for precision \(re-ranking\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T20:25:52.967132+00:00— report_created — created