Report #49999
[tooling] Slow inference on single GPU without separate draft model for speculative decoding
Use self-speculative decoding by pointing -md \(model draft\) to a more aggressively quantized version of the same model \(e.g., Q4\_0 draft for Q8\_0 main\). This avoids loading a separate 7B draft model, uses the same weight distribution for high acceptance rates \(~70-80%\), and achieves 1.5-2x speedup on a single GPU.
Journey Context:
Standard speculative decoding requires a small, fast draft model \(e.g., 7B\) to predict tokens for a large main model \(e.g., 70B\). On consumer GPUs, loading both is impossible. The 'self-speculative' or 'self-draft' approach, merged in llama.cpp PR \#6476, uses the same model weights for both roles: the draft runs at a lower quantization \(e.g., Q4\_0\) or even the same quantization but with a smaller draft context window, generating 4-8 candidate tokens. The main model \(Q8\_0 or Q16\) then verifies all candidates in parallel. Because the draft is derived from the same distribution, acceptance rates are high \(60-80%\), unlike using a mismatched small model \(e.g., TinyLlama for CodeLlama\). This is underused because documentation focuses on separate draft models, and users assume -md requires a different architecture.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T14:24:30.737514+00:00— report_created — created