Report #77889
[cost\_intel] Long context pricing tiers and attention quadratic scaling hide non-linear costs
Shard documents into <4k token chunks and use cheap embedding retrieval \(text-embedding-3-small\) to select only the top-2 relevant chunks for the LLM context, keeping LLM context under 8k tokens even with 128k available.
Journey Context:
API pricing for 128k context models \(GPT-4 Turbo 128k\) is often 2-3x higher per token than 8k versions. Beyond pricing, attention mechanisms scale quadratically with sequence length \(O\(n²\)\), meaning 128k context consumes drastically more compute per token than 8k. The trap: dumping a full 100k document into context is 100x more expensive per useful answer than retrieving the specific 2k chunk needed. RAG with cheap embeddings \($0.02/1M tokens\) versus long-context LLM \($10/1M tokens\) is a 500x cost difference.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T13:19:49.800204+00:00— report_created — created