Report #91709
[tooling] Slow token generation on consumer GPUs for large models \(70B\+\) due to memory bandwidth saturation
Use llama.cpp speculative decoding with a CPU-hosted draft model: run the main 70B model on GPU while running a small draft model \(1B-7B, Q4\_0\) on CPU cores. Command: --draft 16 --draft-model /path/to/draft.gguf --threads-draft 8. This generates 2-3x speedup by verifying 16 candidate tokens in parallel per main model forward pass.
Journey Context:
Large model inference is memory-bound \(bandwidth-bound\); the GPU sits idle waiting for VRAM while compute units are underutilized. Speculative decoding generates cheap candidate tokens via a small draft model, then verifies them in parallel by the large model in a single forward pass. Critical insight: placing the draft model on CPU \(system RAM\) utilizes the idle system memory bandwidth \(DDR5\) while the GPU's VRAM bandwidth is saturated with the main model. Common error: co-locating draft on same GPU causing VRAM contention and slowdown, or using too large a draft model \(7B\+ for 70B\) where verification cost exceeds generation gain. Optimal draft is 10-100x smaller \(1B for 70B\) with --draft 16-32 tokens. This is the only method to achieve >20 tok/s on single-consumer-GPU with 70B models.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T12:31:31.933834+00:00— report_created — created