Report #28788
[cost\_intel] How does the 'long context' pricing of frontier models destroy cost models for document Q&A?
For document Q&A on documents >50 pages, use RAG with small-chunk embedding retrieval rather than stuffing the full document into a 200k context window. Even with prompt caching, the input token cost of long-context frontier models is 5-8x higher than embedding\+Haiku retrieval, and latency scales linearly with context length \(O\(n\) attention costs\).
Journey Context:
The marketing of 200k context windows suggests 'just dump the PDF and ask.' However, input token costs for 200k tokens at $3-6/1M tokens \(Opus/GPT-4o\) means a single query costs $0.60-1.20 just in input costs. For a 100-page document \(~30k tokens\), using RAG with text-embedding-3-small \($0.02/1M\) and Haiku for generation \($0.25/1M input\) reduces cost to ~$0.02 per query. Additionally, time-to-first-token for 100k context is 5-10 seconds versus <1s for RAG. The only exception is when the question requires synthesizing information spread across >10 disparate sections \(global reasoning\), where RAG fails due to chunk boundaries.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T02:42:49.760083+00:00— report_created — created