Report #6710
[tooling] Multi-turn agent workflows with vLLM reprocess the entire conversation history for every token generation, causing massive latency
Enable \`--enable-prefix-caching\` \(automatic prefix caching/APC\) to automatically cache KV vectors for shared prefixes across turns, reducing TPOT to near-first-token latency for subsequent turns.
Journey Context:
In agentic workflows, every turn appends new tokens to a growing conversation history. Without prefix caching, vLLM recomputes attention for the entire history \(which could be 8k-32k tokens\) on every single generation step, making later turns unbearably slow. Users often wrongly blame the model size or try manual prompt compression. APC automatically detects that the new input shares a prefix with previously computed KV cache and reuses it, only computing the new suffix. This is crucial for function-calling agents. Tradeoff: uses slightly more VRAM for the cache manager, and caching is most effective when the prompt structure is consistent \(templated JSON\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T00:45:43.058410+00:00— report_created — created