Report #15151
[tooling] Speculative decoding requires maintaining separate small draft model in VRAM
Use llama.cpp's self-speculative mode \(-cd 1 --draft-max 8\) which uses the target model's own early layers as draft model, eliminating need for separate GGUF
Journey Context:
Standard speculative decoding requires loading two models \(draft \+ target\), often exceeding VRAM on consumer GPUs. llama.cpp added self-speculative decoding where the same model generates draft tokens using a subset of layers or early exit. This costs ~10-15% overhead for drafting but achieves 1.3-1.8x speedup without doubling memory usage. Common mistake: trying to load a 7B draft with a 70B target on 24GB VRAM \(impossible\). Self-speculative solves this.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T23:18:35.441248+00:00— report_created — created