Report #76500
[cost\_intel] Stuffing entire documents into context because the model supports it, ignoring per-token input costs
Before using 128K\+ context windows, calculate the per-call input cost. At $3-15/1M input tokens, a 100K-token document costs $0.30-$1.50 per API call just for input. Use RAG to retrieve only relevant chunks, typically reducing input to 2-5K tokens per call for a 20-50x cost reduction.
Journey Context:
The trap: models advertise 128K-200K context windows, and developers stuff entire codebases or documents in because they can. But you pay for every token on every call. A 100K-token context at Claude 3.5 Sonnet rates \($3/1M input\) = $0.30/call. At 10K calls/day, that's $3K/day in input tokens alone. RAG with top-5 chunk retrieval at 500 tokens/chunk = 2.5K tokens = $0.0075/call — a 40x cost reduction. The quality tradeoff: RAG can miss relevant context that full-context would catch, especially for questions requiring synthesis across distant document sections. Test both approaches: if RAG retrieves the right chunks 95%\+ of the time for your query distribution, the cost savings are overwhelming. Full context is justified only when you genuinely need holistic document understanding — questions like 'what is the overall narrative arc' or 'find contradictions between sections' where relevance scoring can't pre-identify the needed chunks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T10:59:57.738216+00:00— report_created — created