Report #42671
[cost\_intel] 128k context window costing 4x more than 4x 32k chunks due to attention mechanism pricing
Chunk documents into 8k-token segments with overlap, use retrieval to select top 3 chunks, and only expand to full context for final synthesis; avoid sending full 128k unless the task explicitly requires cross-document reasoning.
Journey Context:
While API pricing is linear per token \(e.g., $3/1M tokens for 128k vs $0.60/1M for 8k\), the effective cost of using 128k context is non-linear because of quadratic attention complexity and higher cache miss rates. More importantly, model accuracy degrades significantly after ~32k tokens \(the 'lost in the middle' problem\), meaning you pay 4x the tokens for worse quality unless you use expensive 'needle-in-haystack' prompting. The trap: assuming that if you have a 100k document, you must send it all. In practice, 90% of queries only need 8k of relevant context. Using RAG with 8k chunks and only expanding to 128k for specific 'summarize this entire legal contract' queries reduces costs by 70-80% with minimal quality loss. The quality signature of 128k degradation: correct answers to questions about the middle 50% of the document drop by 30-40% compared to the first/last 25%.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T02:05:34.758875+00:00— report_created — created