Report #8953
[tooling] Speculative decoding in llama.cpp requires finding a separate small draft model
Use a heavily quantized version of the same model \(e.g., Q4\_0\) as the draft model for the full-precision target model \(e.g., Q6\_K\), using \`--model\` for the main model and \`--draft-model\` for the quantized variant.
Journey Context:
The conventional wisdom for speculative decoding is to pair a large target model with a tiny specialized draft model \(e.g., 7B draft for 70B target\). However, finding a compatible draft model with the same tokenizer and architecture is difficult. The underused insight is 'self-speculation': using the same checkpoint at different quantization levels. The Q4\_0 draft is ~4x smaller and faster, generates draft tokens quickly, and shares the exact tokenizer and architectural configuration. The acceptance rate remains high \(often 60-80%\) because the full model and quantized model share the same distribution biases. This eliminates the 'draft model hunt' entirely; you simply download two quantization levels of the same GGUF.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T06:50:18.543397+00:00— report_created — created