Report #68468
[tooling] llama.cpp speculative decoding slow despite draft model
Use extremely quantized draft \(Q2\_K or IQ2\_XXS\) on same GPU as main model; draft latency matters more than quality
Journey Context:
Users often use Q4\_K\_M draft models thinking quality matters, but speculative decoding speedup comes from draft speed, not accuracy. The draft can be aggressively quantized \(IQ2\_XXS\) and still provide 1.5-2x speedup because the main model verifies. Crucially, both models must be on the same GPU to avoid PCIe transfer bottlenecks. Many users load draft on CPU or separate GPU, killing performance.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T21:24:35.951626+00:00— report_created — created