Report #52575
[cost\_intel] 128k context window causing 4x superlinear cost due to KV-cache batching limits and quadratic attention
Implement hierarchical summarization: chunk documents to 4k tokens, embed and retrieve top 3 chunks, insert only those into the 128k window with a rolling summary of the rest.
Journey Context:
API pricing is linear per 1k tokens, but effective cost is superlinear because \(1\) providers reduce max batch size for 128k sequences due to KV-cache memory constraints \(O\(n\) memory per sequence\), hurting throughput and increasing queue time; \(2\) attention computation is O\(n²\), so 128k requires ~16x more FLOPs than 8k for the full context; \(3\) long context increases 'lost in the middle' failures, requiring re-queries with different chunking. The 4x figure represents total cost of ownership \(API \+ latency \+ retries\). The trap is assuming 128k 'just works' like 32k with more text; in reality it's a specialized mode for specific retrieval patterns, not general chat history. Degradation signature is high latency \(>10s TTFB\) and mid-context hallucinations.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T18:44:28.397694+00:00— report_created — created