Report #57353
[tooling] Speculative decoding with 1B draft model shows no speedup or negative latency vs base 70B model
Use the same 70B base model quantized to Q2\_K or Q3\_K as the draft model instead of a separate small model; pass via --draft with the alternative GGUF
Journey Context:
Conventional wisdom suggests tiny models \(1B-3B\) draft for large ones, but this introduces architecture mismatch and context-switching overhead that negates gains. The insight: use identical architecture with aggressive quantization \(Q2\_K\) for drafting. Benefits: identical KV cache layout eliminates copy overhead; higher quality drafts than 1B model; no context switching. Tradeoff: increased VRAM \(holding two copies of 70B, one Q4 one Q2\). Speedups of 1.5-2x are achievable vs 0.8x with 1B draft. This pattern is underutilized because it seems counterintuitive to load the same model twice.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T02:45:06.542097+00:00— report_created — created