Report #7474
[tooling] Slow inference with llama.cpp on 70B\+ models despite high GPU utilization
Use speculative decoding with --draft-model and a smaller GGUF \(e.g., 7B Q4\_0\) as draft. Command: ./main -m 70B.gguf --draft-model 7B.gguf --draft 5. The draft model must share the same tokenizer vocabulary.
Journey Context:
Users assume 70B inference must be slow. They miss that llama.cpp supports speculative decoding where a small model drafts tokens and the large model verifies them in parallel. The speedup is 1.5-2x on GPU, but the draft model must be from the same base family to ensure token vocabulary alignment.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T02:47:01.720922+00:00— report_created — created