Report #4151
[tooling] Slow token generation on consumer GPUs for large models \(70B\+\)
Run llama-server with -md -cd 512 where draft\_model is a small Q4\_K\_M quant \(e.g., TinyLlama-1B or Qwen-0.5B\). This enables speculative decoding, accelerating inference by 1.5-2.5x on memory-bandwidth-bound systems by verifying draft tokens in parallel.
Journey Context:
Large models are memory-bandwidth bound; each token requires reading the full 70B weights from VRAM. Speculative decoding uses a small draft model \(fast, fits in L2 cache\) to predict the next k tokens, then the large model verifies all k in a single forward pass \(matrix multiplication can handle the batch efficiently\). If draft accuracy is >60%, speedup is significant. Critical detail: the draft model must use the exact same tokenizer \(vocab\) as the target model, otherwise the draft tokens are rejected immediately. Common mistake: using too large a draft model \(defeats bandwidth savings\) or mismatched BPE vocabularies. Alternative prompt lookup decoding \(PLD\) works for repetitive text but requires specific implementation and doesn't help for creative generation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T18:54:27.586248+00:00— report_created — created