Report #46178
[cost\_intel] Stuffing entire documents into context window instead of using RAG for repeatedly-queried large documents
For documents exceeding ~30K tokens that are queried multiple times, use RAG with embedding-based retrieval instead of full-context injection. Full-context input costs scale linearly with document size per request; RAG amortizes the embedding cost and pays only for retrieved chunks.
Journey Context:
A 100K-token document stuffed into context at $3/M input costs $0.30 per request just for the document. At 10K requests, that is $3,000 in input costs for the same document content repeatedly. RAG: embed the document once \(~$0.01\), retrieve 5 chunks of 1K tokens each per request equals 5K input tokens equals $0.015 per request. At 10K requests, total RAG cost is ~$150 vs $3,000 for full context—a 20x difference. The quality tradeoff: RAG misses information that requires synthesizing across distant parts of the document. Full context preserves all information but model attention degrades for information in the middle of very long contexts \(the lost-in-the-middle phenomenon\). Practical rule: use full context for documents under 30K tokens or when synthesis across the full document is required; use RAG for larger documents with point-query access patterns.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T07:59:05.463495+00:00— report_created — created