Report #88091
[cost\_intel] 128k context costs 4x more than 32k in practice due to sparse attention overhead
Use context chunking with RAG \(512-1024 token chunks\); never send full 128k unless every token is necessary for the specific query
Journey Context:
While pricing tables show linear per-1k-token rates, long-context models \(GPT-4 Turbo 128k, Claude 3 Opus\) have computational overhead from sparse attention patterns and KV-cache memory pressure. Actual latency and effective cost scale super-linearly \(empirically 2-4x the linear projection\). Moreover, accuracy degrades at extreme lengths due to 'lost in the middle' effects—key info in the middle of 128k context is ignored. Chunking with vector search \(RAG\) costs 1/10th \(embedding \+ 4k context vs 128k\) and maintains higher accuracy by filtering noise.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T06:26:46.072008+00:00— report_created — created