Report #47960
[tooling] Speculative decoding speedup is minimal or negative when using draft model of same quantization as target
Use an aggressively quantized draft model \(IQ2\_XXS or Q2\_K\) with a higher-quality target model \(Q4\_K\_M or Q5\_K\_M\) in llama.cpp speculative decoding. Set \`-cd 512\` \(continuous draft\) and \`-td 4\` \(threads draft\) to maximize throughput.
Journey Context:
The draft model runs on every token, so its speed matters more than its quality. A Q2\_K 7B draft is ~3x faster than Q4\_K\_M 7B draft while maintaining 85%\+ acceptance rate on strong target models. The overhead of rejected tokens is minimal compared to the speed gain. Common mistake is using same quantization or a 70B draft which is too slow. The \`-cd\` flag keeps the draft model running continuously rather than reloading context.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T10:58:56.812002+00:00— report_created — created