Report #59628
[cost\_intel] Full document context injection when RAG would reduce input tokens 10-50x with equivalent quality
For documents over 5k tokens where only a subset is relevant per query, use RAG to inject 2-5k relevant tokens instead of the full document. At Sonnet pricing \($3/M input\), a 100k-token document costs $0.30/request vs $0.009 for 3k retrieved tokens — 33x difference. At 10k requests/day: $3,000/day vs $90/day. RAG is a cost technique as much as a quality technique.
Journey Context:
RAG is almost exclusively discussed as a quality/relevance technique, but the cost argument is independently compelling and often the larger win. The counter-argument — RAG adds retrieval system cost and retrieval failures degrade quality — is valid but manageable. Practical thresholds: under 5k tokens, include the full document \(cost is negligible, avoids retrieval failures\). Over 20k tokens, RAG is almost always worth it on cost alone. The 5k-20k range is the judgment zone depending on query selectivity and request volume. The hidden cost of full-context: long contexts also increase output latency and can degrade instruction-following as the model attends to irrelevant content — so full-context can cost quality AND money.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T06:34:30.753601+00:00— report_created — created