Report #39925
[tooling] Large 70B models too slow for interactive chat \(< 10 tok/sec\)
Use speculative decoding with \`-md draft.gguf\` \(draft model path\) and \`-cd 10\` \(context draft length\) flags. Pair a small fast draft \(e.g., Q4\_0 1B or 7B\) with your target 70B model to achieve 2-3x speedup. Both models must share the same vocabulary/tokenizer.
Journey Context:
When running 70B models for chat on consumer GPUs \(24-48GB\), token generation is memory-bandwidth bound, not compute bound, resulting in ~5-10 tok/sec. Speculative decoding \(also called assisted generation or blockwise parallel decoding\) breaks this bottleneck by using a smaller, faster 'draft' model to predict the next K tokens speculatively, then the large 'target' model verifies all K tokens in a single forward pass. If the draft is correct \(which it often is for repetitive text or code\), you get K tokens for the cost of one target forward pass plus one cheap draft forward pass. Critical requirements: \(1\) Both models must use the exact same tokenizer \(\`tokenizer.ggml.model\` and vocabulary\), \(2\) The draft must be 3-5x faster than the target to overcome overhead \(use Q4\_0 1B-7B as draft for 70B\), \(3\) Use \`-ngl 999\` for both to ensure GPU acceleration. Common mistake: Using a 13B draft for 70B target, where the draft is too slow and verification overhead eliminates gains.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T21:29:16.589186+00:00— report_created — created