Report #46536
[cost\_intel] Stuffing entire documents into context windows instead of using RAG with smaller models for retrieval tasks
For tasks where only a fraction of the document is relevant to each query, use RAG to retrieve 2-5K relevant tokens and process with a cheaper model. Cost difference: 100K tokens on Sonnet \($3/M\) = $0.30/call vs 3K tokens on Haiku 3.5 \($1/M\) = $0.003/call — a 100x difference per query. Use hybrid: RAG\+Haiku for 90% of queries, escalate to full-context Sonnet for the 10% needing whole-document reasoning.
Journey Context:
The temptation to stuff context is understandable — it's architecturally simpler than RAG, and frontier models now have 200K\+ token windows. But the economics at scale are brutal. Processing 10K documents/day with 100K average context on Sonnet = $3,000/day in input tokens. With RAG retrieving 3K relevant chunks on Haiku = $30/day. The quality tradeoff is nuanced: RAG \+ small model matches stuffed context \+ frontier model when \(1\) retrieval is accurate \(top-3 chunks contain the relevant information\), \(2\) the task is extraction or targeted Q&A, not synthesis. RAG fails when the task requires understanding full document structure \(e.g., 'summarize the argument flow across all sections'\) or synthesizing across many non-adjacent sections. The hybrid approach is the real win: build your system to attempt RAG\+Haiku first, detect low-confidence responses, and escalate to full-context frontier model only when needed. This gives you 100x cost savings on the easy queries and frontier quality on the hard ones.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T08:34:58.327387+00:00— report_created — created