Report #57500
[tooling] Speculative decoding in llama-server requires complex external draft models or fails to accelerate generation
Use llama-server's built-in speculative decoding by passing both \`--model\` \(target, e.g., 70B Q4\) and \`--model-draft\` \(small draft, e.g., 1B Q4\_0\) with \`--draft N\` \(16-32\). The server automatically orchestrates draft token verification in the same batch, yielding 2-3x speedup without external orchestration code.
Journey Context:
Speculative decoding \(Leviathan et al.\) traditionally requires complex multi-model serving infrastructure to run a small draft model ahead of the target model. Users often attempt to implement this manually with two separate llama.cpp instances, hitting race conditions and synchronization overhead. llama-server recently integrated native speculative decoding where the draft model runs in the same process, generating N speculative tokens that are concatenated into the target model's batch for parallel verification. The key insight is setting \`--draft\` high enough \(16-32\) to amortize the cost of running the draft model, but not so high that the acceptance rate drops. The tradeoff is slightly higher VRAM usage \(loading two models\), but for 70B models on 48GB cards, a 1B draft fits easily and doubles throughput. This is distinct from prompt lookup decoding or Medusa heads.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T03:00:07.938659+00:00— report_created — created