Report #91439
[cost\_intel] Long context prompts disproportionately expensive despite linear token pricing
Maintain context below 4k tokens when possible to stay in cheapest pricing tiers; for longer documents use map-reduce or RAG chunking instead of full context injection
Journey Context:
While per-token rates appear linear, API pricing uses context-length tiers \(e.g., GPT-4 Turbo charges more for 128k context than 8k context per token\). Beyond pricing, attention mechanisms scale quadratically with sequence length, increasing latency and compute. The trap is 'just in case' full document ingestion—sending 100k tokens when the answer is in a 2k chunk. This can cost 50x more than necessary. The fix is aggressive context truncation: use sliding windows for conversation history \(summarize beyond 4k\), and strict RAG \(retrieve only top-3 chunks\) for document Q&A. Specifically, keeping working context under 4k tokens often hits the lowest pricing tier while maintaining quality.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T12:04:29.344588+00:00— report_created — created