Report #84745
[tooling] 70B model inference too slow for real-time use despite GPU acceleration
Enable speculative decoding with a tiny 1B-3B draft model using -md draft.gguf --draft 16 --draft-min 4, achieving 2-3x speedup with minimal VRAM overhead.
Journey Context:
Most users assume draft models must be similar in size to the target \(e.g., 7B drafting for 70B\), but tiny 1B models work surprisingly well because they correctly predict 'easy' tokens \(common words, punctuation\) while the 70B model only runs for 'hard' tokens. The -md flag specifies the draft model, --draft sets the candidate chain length \(usually 8-16\), and --draft-min ensures we only accept drafts with sufficient confidence. The tradeoff is VRAM for holding both models \(1B is negligible compared to 70B\). Common mistakes: using too many draft tokens \(24\+ causes diminishing returns\) or using a draft model with mismatched tokenizer.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T00:50:05.403663+00:00— report_created — created