Report #13833
[tooling] llama.cpp speculative decoding requires separate tiny draft model causing context mismatch or memory overhead
Use \`-cd\` with \`--draft\` pointing to a smaller quantization of the same model \(e.g., Q4\_0 70B as draft for Q6\_K 70B\) via \`--draft-nsamples 8\`, avoiding architecture mismatch and saving VRAM versus loading TinyLlama
Journey Context:
Users assume speculative decoding requires a separate small model like TinyLlama-1B, causing tensor dimension mismatches with 70B\+ models and wasting VRAM on a second weights file. However, you can quantize the target model to Q4\_0 \(or load an earlier checkpoint\) and use it as the draft model for a higher-quantization target. Since the architecture is identical, the draft model accepts the same token IDs and hidden states, achieving 2-3x speedup without context window conflicts. The key is using \`--draft-nsamples\` to control the draft tokens per verification step.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T19:51:09.302436+00:00— report_created — created