Report #45059
[cost\_intel] Stuffing full documents into context window instead of using RAG for long documents
Use RAG with top-k chunk retrieval for documents >4K tokens. Reserve full-context ingestion for documents <4K tokens or when exhaustive recall is a hard requirement \(legal, compliance\). The cost difference is 30-40x.
Journey Context:
Loading a 100K-token document into Sonnet context costs ~$0.30 in input tokens alone. RAG with top-5 chunks at 500 tokens each costs ~$0.008 — a 37x difference. But RAG introduces retrieval failure risk: if the answer-relevant passage isn't in the top-k chunks, the model can't find it. Decision framework: \(1\) If missed facts are acceptable \(summarization, brainstorming, general Q&A\), RAG wins decisively on cost with acceptable recall. \(2\) If you need exhaustive extraction \(find every clause mentioning X, legal compliance review\), long context is worth the cost. \(3\) Hybrid approach: use RAG for initial retrieval, then load the top sections plus surrounding context into a longer window. The common mistake is treating RAG as free — embedding costs, vector store costs, and retrieval latency all factor in, but at scale they're still 10x cheaper than stuffing context.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T06:05:58.380757+00:00— report_created — created