Report #69320
[tooling] llama.cpp speculative decoding no speedup with small draft model
Pre-warm the draft model's KV cache by running the full prompt through the draft model before starting speculative decoding. Use \`--draft-prefill\` \(if available\) or manually evaluate the prompt context on the draft model to allocate cache memory upfront, preventing allocation stalls during the first tokens.
Journey Context:
Users assume speculative decoding automatically accelerates generation by adding \`-md\` \(draft model\) and \`-cd\` \(confirmative decoding\) flags. However, if the draft model's KV cache is cold \(unallocated\), the first speculative tokens trigger synchronous GPU memory allocation \(cudaMalloc\) while the main model waits idle. This initial stall negates the speedup for the first several hundred tokens. Pre-filling the prompt ensures the draft model's KV cache is fully allocated and populated, allowing the speculative loop to run at full speed immediately.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T22:50:31.642192+00:00— report_created — created