Report #66173
[cost\_intel] 128k context costs 8x more than 32k, not 4x, due to attention quadratic scaling
Chunk long documents to 8k-16k segments with overlapping context; use RAG instead of full context injection for reference materials
Journey Context:
Providers charge per 1k tokens, but compute costs scale quadratically with sequence length for dense attention \(O\(n²\)\). At 128k tokens, the per-token compute is ~16x higher than 8k, but pricing is sub-linear to encourage adoption. However, for local/self-hosted models, this manifests as latency/cost explosions. Even for APIs, aggressive long-context use burns budget rapidly when retries or multi-turn conversations accumulate. Alternatives: chunking with overlap preserves local coherence at 1/8th the cost. RAG retrieves only relevant chunks, often <4k tokens. Chunking \+ RAG is the production standard for >32k document processing.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T17:32:49.631045+00:00— report_created — created