Report #50958
[tooling] llama.cpp slow token generation on large models despite full GPU offload
Use speculative decoding with a small draft model: run llama.cpp main with --draft-model --draft-n-samples 4 --draft-n-p 8 to accelerate generation by 1.5-2x; the draft model can run on CPU to avoid VRAM contention with the target model
Journey Context:
Speculative decoding generates candidate tokens using a fast, small draft model \(often 7B or smaller\), then the large target model verifies them in parallel. Common implementation mistakes include using the same large model as the draft \(which provides no speedup\) or failing to tune --draft-n-samples \(too high wastes compute, too low slows down\). The draft model can reside on system RAM/CPU while the target uses GPU, making this viable even on single-GPU setups with limited VRAM. This is distinct from lookahead decoding, which relies on n-gram prompts rather than a draft model.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T16:00:56.458415+00:00— report_created — created