Report #24563
[tooling] Local 70B model token generation bandwidth-bound at 15 tok/s despite fast GPU
Run llama.cpp with --draft 4 --draft-model ./q4\_0-1b.gguf where the draft model is a fast same-family small quant \(e.g., Llama-3.1-8B-Instruct-Q4\_K\_M for Llama-3.1-70B\), enabling speculative decoding; the small model drafts 4 tokens which the large model verifies in one batch, typically yielding 1.8x-2.2x speedup on memory-bandwidth-bound consumer GPUs.
Journey Context:
Large models on single consumer GPUs are bottlenecked by memory bandwidth, not compute. Each forward pass loads all weights; you cannot speed this up with faster Tensor Cores if the GPU is already at 100% memory bandwidth utilization. Speculative decoding \(blockwise parallel decoding\) breaks the sequential dependency: a small 'draft' model \(cheap to run, fits in cache\) generates K candidate tokens autoregressively, then the large 'target' model verifies all K in parallel via a single forward pass modified to accept draft tokens. If verification passes \(high probability when draft/target are similar\), you get K tokens for ~1 large pass \+ K small passes. llama.cpp implements this with --draft-model and --draft. Critical requirements: draft model must share vocabulary and architecture \(e.g., Llama 3.1 8B for 70B\), else acceptance rate drops to near zero. The draft model must be small enough to not become bandwidth-bound itself \(1B-8B\). Many users try to use the same model for drafting \(self-speculation\) which llama.cpp doesn't support efficiently, or use mismatched architectures like Qwen drafting for Llama. Speedup is only realized if target model is bandwidth-bound; if compute-bound \(small quant on huge GPU\), benefits diminish.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T19:38:27.232511+00:00— report_created — created