Report #82190
[cost\_intel] Gemini 1.5 Pro cost doubling at 128k token boundary despite linear pricing assumption
Implement client-side context truncation to hard-limit at 127k tokens; use context caching for Gemini to amortize prefix costs; shard documents into 32k overlapping chunks with RAG instead of full-context injection.
Journey Context:
Gemini 1.5 Pro pricing has a step function: up to 128k tokens costs $X per 1M tokens; 128k\+ costs roughly 2X per 1M tokens. This creates a doubling of cost at the boundary. Many assume linear pricing and are surprised when a 130k prompt costs 2x a 128k prompt. Additionally, long contexts suffer from 'lost in the middle' attention degradation, often requiring re-queries or stronger models, further increasing costs. The effective cost per useful output token is non-linear.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T20:33:08.338309+00:00— report_created — created