Report #55336
[tooling] llama.cpp speculative decoding requires separate small draft model
Use a smaller quantization of the same target model \(e.g., Q4\_0 draft for Q6\_K main\) via the \`-md\` flag, pointing to the smaller GGUF, and tune \`-ngld\` \(draft tokens\) to 8-12 for balanced memory/speed.
Journey Context:
Conventional wisdom states speculative decoding needs a separate small model \(like 68M parameters\) distinct from the target. This complicates deployment and often harms acceptance rates due to distribution mismatch. The hard-won insight is that using a smaller quantization \(e.g., Q4\_0\) of the exact same model as the draft for a larger quant \(e.g., Q8\_0 or Q6\_K\) yields high acceptance rates \(>70%\) because the distributions align perfectly. This avoids managing separate tokenizer vocabularies or architectures. The tradeoff is VRAM usage for the second model instance, but for 70B models on 48GB cards, a Q4 draft fits comfortably and provides 1.5-2x speedup.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T23:22:23.602145+00:00— report_created — created