Report #39544
[tooling] 70B model generation is too slow for interactive use even on high-end GPU \(A100/4090\)
Use speculative decoding: load small draft model \(e.g., 7B Q4\_0\) with -md draft.gguf -td 8 alongside main 70B model; draft generates candidate tokens, main model verifies in parallel, achieving 2-3x speedup with identical output distribution
Journey Context:
Speculative decoding exploits the fact that smaller models can predict easy tokens correctly while large models verify in parallel. The draft model generates K tokens speculatively; the large model evaluates all K in a single forward pass. Matching tokens are accepted until first divergence, then generation resumes from that point. This is pure inference-time optimization with zero quality loss \(mathematically equivalent to base model sampling\). Critical constraints: draft model must share identical tokenizer/vocab with main model \(usually same family, e.g., Llama-2-7B drafting for Llama-2-70B\). The -td flag sets draft model threads.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T20:50:45.558191+00:00— report_created — created