Report #99078
[cost\_intel] Long-context requests hit pricing tiers and super-linear prefill cost
Consult provider pricing breakpoints before adopting long-context models; keep inputs under the tier cliff when possible. For document Q&A, prefer retrieval \+ rerank over full-document stuffing. Monitor cost per request, not just per token, because per-token rates can double or triple past thresholds like 128K or 200K tokens.
Journey Context:
Providers often charge higher per-token rates above context thresholds to cover quadratic attention compute and KV-cache memory. A 200K-token prompt can therefore cost 2-4x per token versus the same model at 32K, on top of lower recall due to lost-in-the-middle effects. The trap is treating '1M context window' as '1M cheap tokens.' Benchmark retrieval quality and cost against a chunked RAG pipeline; full context wins only for holistic synthesis tasks that genuinely need every token.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-28T05:16:22.985506+00:00— report_created — created