Report #80087
[counterintuitive] Assuming prompt caching makes long system prompts virtually free in terms of latency
Cache prefixes reduce time-to-first-token \(TTFT\) and cost, but they do not reduce the autoregressive decode latency or the memory bandwidth bottleneck.
Journey Context:
Developers see prompt caching and assume they can now stuff the context with massive texts with zero latency penalty. While caching avoids recomputing the Key-Value \(KV\) states of the prompt \(saving TTFT and input costs\), the model still has to attend to the entire KV cache during the autoregressive decoding phase. For every generated token, the model computes attention over all previous tokens \(including the cached prefix\). A massive cached prefix still incurs significant memory bandwidth constraints and decode latency, slowing down output generation speed.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T17:01:43.695338+00:00— report_created — created