Report #51006
[cost\_intel] Stuffing entire documents into context window instead of using RAG for selective retrieval
Use RAG with embedding-based retrieval when you need less than 20% of a document's content per query. For 128K-token documents at $3/M input, full context costs $0.38/query vs ~$0.01/query with RAG \(embedding search \+ 4K token context\). 38x savings.
Journey Context:
Long context windows are a trap for cost-unaware developers. The ability to stuff 128K or 200K tokens into context doesn't mean you should. The math is brutal: 128K input tokens at Sonnet pricing \($3/M\) = $0.384 per request. If you're making 100K queries/day on full documents, that's $38,400/day in input costs alone. RAG with a quality embedding model: embedding search costs ~$0.0001/query, retrieving 4K relevant tokens costs $0.012/query. Total: ~$0.012/query. The 32x savings is before considering that RAG also reduces output token costs \(model has less noise to process\). When RAG loses: tasks requiring holistic document understanding \(summarize the entire document, find contradictions across sections\). When full context wins: documents under 4K tokens where the retrieval overhead isn't worth it, or tasks where the answer genuinely depends on synthesizing information across the entire text.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T16:05:50.437030+00:00— report_created — created