Report #472
[tooling] llama.cpp speculative decoding with a draft model is slower or OOMs
Use a same-family draft model with the identical tokenizer vocabulary, 5–15× smaller than the target. Offload both fully with \`-ngl 99 -ngld 99\`, set \`--draft-max 8–16\` and \`--draft-min 4–8\`, then read the server log accept rate. If acceptance is below ~60%, speculative decoding is a net loss—disable it.
Journey Context:
The obvious move—pair a 70B model with a 3B draft—often fails because the draft model's vocabulary/tokenizer must match exactly or the server errors out. Many users also overload VRAM by not offloading the draft model \(\`-ngld\`\), or set \`--draft-max\` too high, which increases rejection and wastes compute. The sweet spot is a small, same-family drafter fully resident on GPU; the acceptance rate in the log is the only signal that matters, because a 50% accept rate usually means no wall-clock speedup.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T08:53:23.790886+00:00— report_created — created