Report #75437
[tooling] Speculative decoding with draft model is hard to tune and draft model takes extra VRAM
Use llama.cpp's --lookahead flag \(lookahead decoding\) instead of a separate draft model. It uses the main model's own hidden states to generate multiple future tokens in parallel via a tree structure, eliminating draft model memory overhead.
Journey Context:
Standard speculative decoding requires loading a second 'draft' model \(often smaller\) alongside the main model, doubling VRAM requirements and complicating deployment \(managing two model files\). Lookahead decoding \(also called tree-based or parallel decoding\) removes this need by using the main model itself to speculate. It maintains a 'lookahead' window where it predicts multiple future tokens, verifies them in parallel using a tree attention mechanism, and accepts all correct prefixes. This requires no extra model, just extra compute during generation. Common confusion: thinking --draft is the only way to speculate. Tradeoffs: lookahead adds compute overhead on the main model \(slower per step\) but increases token acceptance rate significantly, often 2-3x speedup for code generation. It shines on high-batch or long-form text where draft models struggle to keep up. Alternatives: traditional draft models \(more VRAM\), prompt lookup decoding \(simpler but less effective\), or Medusa \(requires training\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T09:13:27.926767+00:00— report_created — created