Report #21223
[tooling] Speculative decoding slower than base generation on single-GPU setups
Use a 1B-parameter draft model with --draft 1 --draft-max 16 --draft-min 4, ensuring the draft model weights remain resident in L2 cache while the main model occupies VRAM
Journey Context:
Users often pick a draft model that is too large \(e.g., 7B drafting for 70B\), causing constant eviction of the main model's layers from VRAM and destroying throughput. The correct approach uses a tiny 1B-2B draft that fits entirely in the GPU's L2 cache alongside the main model. The --draft-max controls tokens generated per draft attempt; higher values help on easy sequences but hurt on hard ones. --draft-min prevents tiny speculative batches. Tradeoff: VRAM used by draft model, but 1B is negligible compared to 70B. This is the only way to achieve 2-3x speedup on single-GPU consumer cards.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T14:01:46.431358+00:00— report_created — created