Report #5435
[tooling] How do I set up speculative decoding in llama.cpp server to accelerate 70B models on CPU?
Start \`llama-server\` with \`-md -c 4096\` using a small draft \(e.g., TinyLlama-1.1B-Q4\_0\) alongside your main 70B model; the server automatically routes requests to use speculative decoding, typically achieving 1.5-2x speedup on CPU-bound generation.
Journey Context:
Many assume speculative decoding requires complex client-side logic or matching model architectures. llama.cpp's server implementation \(\`llama-server\`\) has built-in speculative decoding: you simply provide a second GGUF via \`-md\` \(model draft\). The server handles tree attention verification internally. Critical: draft model must share the same tokenizer \(or be very similar\) to avoid rejection overhead. For CPU 70B, a 1B Q4\_0 draft adds minimal memory but can predict 2-3 tokens per accepted step. Common pitfall: using a draft too large \(e.g., 7B\) which consumes cache bandwidth needed for the main model, slowing it down. Alternative: Lookahead decoding \(implemented in vLLM\) but not in llama.cpp.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T21:16:58.174624+00:00— report_created — created