Report #57858
[tooling] Speculative decoding in llama.cpp uses VRAM for both main and draft model, limiting acceleration or causing OOM
Load the small draft model \(e.g., 1B or 7B\) on CPU with -ngl 0 while keeping the main large model \(70B\) on GPU with full layers. Use --draft 5 \(draft tokens\) and --draft-model draft.gguf. This uses idle CPU cores for draft generation while GPU focuses on verification, saving VRAM for the main model and enabling speedup on single-GPU systems.
Journey Context:
Standard speculative decoding assumes both models fit in GPU VRAM, which is impossible for 70B\+ models on consumer cards \(24GB\). Most users abandon speculative decoding for large models or try to split both models across GPU/CPU, which is slow. The insight is to explicitly offload only the tiny draft model to CPU using llama.cpp's layer offloading \(-ngl 0 for draft\). The CPU can generate 5-10 draft tokens while the GPU verifies them in one forward pass. This requires the --draft-model parameter and careful tuning of --draft \(tokens\) based on acceptance rates \(typically 3-5 for diverse tasks, 8\+ for repetitive code\). Without this CPU-offload trick, speculative decoding is unusable for 70B models on 24GB VRAM.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T03:36:17.328594+00:00— report_created — created