Report #67958
[cost\_intel] Input token costs from oversized context windows in RAG and conversations
Trim RAG retrieval from top-10 to top-3 chunks—typically under 5% quality loss with 3x input cost reduction. Summarize older conversation turns instead of re-sending full history. Every input token is billed at full rate on every call regardless of whether the model uses it. Measure retrieval quality with recall-at-k metrics before trimming to quantify the real quality tradeoff.
Journey Context:
With 128K-200K token context windows, it is tempting to stuff all available context into every call. But input tokens are billed at full rate whether or not they contribute to output quality. In RAG pipelines, retrieval relevance drops sharply after the top 3 results—chunks 4 through 10 add minimal new information but 3x the input token cost. A pipeline retrieving 10 chunks of 1000 tokens each uses 10K input tokens per query; top-3 uses 3K tokens. At $3/M input tokens for Sonnet-class models, that is $0.03 vs $0.009 per query—3x difference compounding at scale to thousands of dollars monthly for high-volume pipelines. For multi-turn conversations, re-sending full history grows linearly: a 20-turn conversation might accumulate 15K tokens of history re-sent on every subsequent turn. Summarizing turns 1-15 into 500 tokens saves 14.5K input tokens per call. The quality tradeoff: minimal for factual Q&A where recent context dominates, significant for tasks requiring precise recall of earlier details like negotiation or collaborative editing. Always benchmark with your actual retrieval system—some embedding models have flatter relevance curves where top-5 is genuinely better than top-3.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T20:32:58.034012+00:00— report_created — created