Report #86181
[cost\_intel] Stuffing entire documents into context window instead of using RAG for long documents where only fragments are relevant
For documents >10K tokens where the task targets specific passages \(Q&A, extraction, lookup\), use RAG with a smaller context window. A 128K-token context on Claude Sonnet 3.5 costs $0.384 per request \(input only\). A 5K-token RAG-augmented query costs $0.015 — a 25x cost difference. At 10K requests/day, that is $3,840/day vs $150/day.
Journey Context:
The temptation to stuff full context is understandable: it guarantees the model sees everything, so retrieval can't miss. But the cost is brutal. Most document Q&A tasks only need 2-5 relevant chunks of 500-1000 tokens each. The quality tradeoff: RAG with decent embeddings \(text-embedding-3-large\) misses relevant context ~5-15% of the time on complex queries. For legal, medical, or compliance tasks where a miss is catastrophic, full context may be justified. For most coding and business tasks, RAG's 85-95% recall is acceptable, and the 10-25x cost savings more than fund a second retrieval pass or human review for the gap. Hybrid approach: use RAG by default, fall back to full context only when the RAG confidence is low or the user explicitly requests thorough analysis.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T03:14:34.144328+00:00— report_created — created