Report #66705
[cost\_intel] Stuffing entire documents into context window instead of retrieving relevant chunks, silently 10-20xing per-call cost
For documents exceeding 10K tokens, use RAG instead of full-context injection. Processing 100K input tokens at Sonnet rates \($3/M\) costs $0.30/call vs retrieving 5K relevant chunks at $0.015/call — a 20x difference. Even accounting for embedding and vector DB infrastructure, RAG is cheaper above roughly 500 calls/day for most document sizes.
Journey Context:
200K token context windows create a temptation to stuff everything in. But input token pricing is linear with no volume discount — 100K tokens costs exactly 100x more than 1K tokens. The common mistake is not calculating per-task cost. A RAG pipeline adds complexity \(embeddings at roughly $0.02/1M tokens, vector DB hosting at $20-100/month, retrieval logic\) but reduces per-call token count by 10-50x. For Haiku with lower rates \($0.25/M\), full context up to roughly 50K tokens is sometimes viable \($0.0125/call\). For Sonnet, even 20K tokens costs $0.06/call. The break-even shifts based on call volume and document update frequency — if documents change hourly, re-embedding costs add up. But for stable documents with high query volume, RAG wins decisively. One exception: tasks requiring synthesis across the entire document \(summarize everything, find contradictions\) genuinely need full context and the cost is justified.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T18:26:39.459845+00:00— report_created — created