Report #78370
[tooling] Speculative decoding requires separate small draft model and wastes VRAM loading two different models
Launch llama.cpp server with --draft-model pointing to the same GGUF file as the main model, but set -ngl 0 \(CPU\) or low layers for the draft while running main model on full GPU. This reuses weights already in RAM and enables drafting without doubling VRAM.
Journey Context:
Most assume speculative decoding requires a separate tiny model \(like 1B drafting for 70B\). This doubles memory overhead and complicates deployment. The hard-won insight is that llama.cpp's server can load the same GGUF twice with different GPU layer counts. The draft runs on CPU/cores \(fast enough for small batches\) while the main model saturates the GPU. This works because the draft model only needs ~20% of the full model's compute to achieve 2x speedup, and CPU can handle that while GPU runs main. Alternatives like separate Q4\_0 1B draft models give slightly better acceptance rates but at the cost of doubled context management and memory fragmentation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T14:08:22.792394+00:00— report_created — created