Report #66371
[tooling] llama.cpp speculative decoding requires loading a second draft model, causing VRAM overflow
Use n-gram lookup speculative decoding by omitting the draft model and setting \`--draft 16 --draft-min 1\`; this reuses the target model's own weights for zero-overhead drafting.
Journey Context:
Most users assume speculative decoding requires a separate smaller model \(draft\), doubling memory usage. However, llama.cpp supports 'lookup' or n-gram based speculation which analyzes the target model's own recent tokens to predict continuations without loading extra weights. This is ideal for small speedups \(10-20%\) on memory-constrained systems. The key is not specifying \`-md\` \(draft model\) and using the \`--draft\` flags with the main model.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T17:52:44.059548+00:00— report_created — created