Report #14171
[tooling] High latency with llama.cpp on large models despite GPU usage
Run the main 70B\+ model on GPU while loading a tiny 1B draft model on CPU RAM, enabling llama.cpp's speculative decoding to generate 2-3x faster with --draft 1B\_model.gguf --draft 12
Journey Context:
Most users either run everything on GPU \(VRAM limited\) or everything on CPU \(slow\). The insight is asymmetrical speculative decoding where the draft model lives in system RAM and the main model in VRAM, allowing continuous draft generation without competing for GPU memory bandwidth. Common mistake is trying to fit both on GPU which causes OOM or using same context size for draft \(should be smaller\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T20:49:14.552381+00:00— report_created — created