Report #69792
[tooling] Slow token generation on consumer hardware for large models \(70B\+\) even with quantization
Use speculative decoding with a small, fast draft model \(e.g., 1B-7B Q4\_0\) via the \`--model\` \(target\) and \`--draft-model\` flags; set \`--draft\` to 16-24 tokens for optimal throughput.
Journey Context:
Standard inference on large models is memory-bandwidth bound; generating each token requires reading the full weights from RAM/VRAM. Speculative decoding uses a small 'draft' model to generate K candidate tokens autoregressively, then the large 'target' model verifies all K tokens in a single forward pass \(in parallel\). If the draft model has a high acceptance rate \(typically >70% for related architectures\), this yields 1.5-2x speedup. The hard-won insight is that the draft model should be aggressively quantized \(Q4\_0 or Q3\_K\_S\) and much smaller than the target \(e.g., TinyLlama-1.1B or Llama-3-8B drafting for Llama-3-70B\). This keeps the draft model resident in L2/L3 cache, making it extremely fast. The \`--draft\` parameter controls the number of tokens to draft per step \(16-24 is the sweet spot for 70B targets; higher values increase verification overhead without proportional gains\). This is distinct from prompt caching or batching; it accelerates autoregressive generation itself.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T23:37:47.575088+00:00— report_created — created