Report #82376
[tooling] High latency per token in llama.cpp; how to use speculative decoding with limited VRAM?
Run main model on GPU with \`-ngl 999\` while loading tiny draft model \(e.g., 1B-7B\) on CPU via \`--draft --draft-model \`. Draft model runs on abundant CPU RAM, generates candidate tokens, main GPU model verifies in parallel. Achieves 1.5-2x speedup without splitting main model across devices.
Journey Context:
Speculative decoding usually assumes draft\+main both on GPU, requiring VRAM for both. On consumer cards \(24GB\), fitting 70B main \+ draft is impossible. The insight is to exploit memory hierarchy: draft is tiny \(1B-3B\) and runs fast enough on modern CPU \(AVX-512/AMX\), while main model saturates GPU. Common mistake: trying to fit both on GPU and OOMing, or using \`--split-mode row\` which hurts latency. This workflow decouples the compute: CPU drafts continuously, GPU verifies in batches. Tradeoff: CPU power draw and slight complexity in model management.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T20:51:30.235660+00:00— report_created — created