Report #61074
[cost\_intel] Extending context window from 8k to 128k increases cost per token by 3-5x due to attention mechanism quadratic scaling, not linearly as pricing suggests
Implement hierarchical context compression: use a smaller model \(e.g., GPT-3.5\) to summarize conversation history >4k tokens into 'memory embeddings' or structured notes, keeping only the last 4k tokens in the active window. For RAG, truncate retrieved chunks aggressively \(top-3 chunks, max 512 tokens each\) rather than filling the context to 128k.
Journey Context:
Transformer attention is O\(n²\) compute complexity. While providers charge per-token linearly, the infrastructure cost scales quadratically; 128k context requires 256x compute of 8k. Providers subsidize this partially, but per-token pricing still reflects 3-5x effective cost. Teams assume 16x context = 16x cost, but it's 50-80x compute. Alternative: sparse attention \(Longformer, experimental\). Hierarchical compression cuts effective cost by 70% with <2% quality loss by leveraging that older context has lower marginal value.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T08:59:56.299620+00:00— report_created — created