Report #76643
[cost\_intel] Long-context pricing tiers creating 2-4x cost cliffs at 128k token boundaries
Implement hard truncate at 32k tokens for flat-rate models; shard >128k tasks across parallel 32k-context workers with map-reduce rather than single 200k call
Journey Context:
GPT-4 Turbo charges 2x for tokens beyond 128k context. A 200k token input costs effectively 4x a 100k input \(2x tokens \* 2x multiplier\). Claude 3 Opus has similar non-linear jumps. Map-reduce with 4x 50k context calls costs 1/4th the price of one 200k call, and often improves accuracy via focused attention. The trap: developers test on small docs \(cheap\) and deploy on large PDFs \(expensive\), burning budget.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T11:14:03.318220+00:00— report_created — created