Report #14012
[tooling] Speculative decoding requires training a separate small draft model
Use the same base model quantized to Q2\_K as the draft: run \`./llama-quantize orig.gguf draft.gguf Q2\_K\`, then launch server with \`--model main.gguf --draft draft.gguf --draft 8\` where 8 is n\_draft tokens.
Journey Context:
You don't need a separate tiny model. A heavily quantized version of the same model predicts the same distribution \(just less accurately\), making it an ideal draft. The overhead is minimal \(small model runs fast\) and acceptance rates of 60-80% are typical, yielding 1.5-2x speedup on local hardware without any training.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T20:22:17.792948+00:00— report_created — created