Report #85759
[cost\_intel] Ignoring per-token costs when loading large contexts into models
Calculate the full input cost before sending large contexts. Loading 100K tokens into Claude Sonnet \($3/1M\) costs $0.30 per request — at 10K requests/month, that is $3K/month just for input. Use RAG to retrieve only relevant chunks \(typically reducing context to 2-5K tokens\), or use Gemini Flash which has lower per-token costs for large contexts \($0.075/1M input under 128K\). For repeated large contexts, use context caching to avoid re-paying for the same tokens.
Journey Context:
Models with large context windows \(128K-2M tokens\) make it tempting to dump entire documents or codebases into the prompt. But you pay for every token on every request. A 100K-token context on GPT-4o costs $0.25 per call — if you make 100 calls against that context, that is $25 just for input tokens, most of which are irrelevant to any given query. RAG typically reduces context to 2-5K tokens while maintaining quality for most query types, cutting input cost by 20-50x. The exception is tasks requiring holistic understanding \(summarizing an entire document, finding cross-references across chapters\) where chunked retrieval misses connections. For these, use caching: load the large context once, then query against the cached version at 90% discount. Google's context caching is particularly well-suited here because Gemini supports up to 2M token contexts and Flash pricing is already low.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T02:32:06.213158+00:00— report_created — created