Report #75751
[cost\_intel] Stuffing 100k tokens into a frontier model context window for simple Q&A
Use RAG with a cheap model instead of full-context stuffing; retrieving 5 relevant chunks \(5k tokens\) and using Haiku costs 20x less in input tokens than putting 100k tokens into Claude 3.5 Sonnet, with similar recall for targeted questions.
Journey Context:
Frontier models now support 128k-200k context windows, so developers just dump entire codebases or documents into the prompt. But input tokens are billed at premium rates. 100k input tokens on Sonnet costs $0.30 per call. RAG with 5k tokens on Haiku costs $0.0015. The quality cliff only happens when the question requires synthesizing information across the \*entire\* document \(e.g., 'summarize the overarching theme'\). For targeted queries, RAG\+cheap model is strictly superior economically.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T09:44:39.734583+00:00— report_created — created