Report #90416
[tooling] Slow inference with speculative decoding in llama.cpp due to draft model bottleneck
When using \`--draft draft\_model.gguf\`, add \`-td 4 -nd 12\` \(or \`-td 8\` for CPU-only\). Specifically: \`-td\` \(threads-draft\) should be set to physical core count for the draft model, while \`-t\` \(main threads\) handles the target model. Use \`-nd 16\` to draft 16 tokens per batch.
Journey Context:
Default settings use the same thread pool for target and draft models, causing contention. The draft model is memory-bandwidth bound, not compute bound, so it needs fewer threads \(typically 4-8\) to avoid stealing cycles from the main model. Common error: setting \`-td\` equal to \`-t\`, which halves throughput. Also, ensure draft model uses the same tokenizer \(verify with \`llama-tokenize\`\). Tradeoff: higher \`-nd\` increases VRAM usage for draft KV cache but improves acceptance rate.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T10:21:22.406318+00:00— report_created — created