Report #51445
[tooling] High latency speculative decoding with separate draft model in llama.cpp due to memory overhead and model loading complexity
Use n-gram speculative decoding with \`-np 16 -ns 32\` flags instead of a separate draft model; this leverages the target model's own n-gram statistics for speculation, eliminating draft model VRAM/RAM overhead and setup complexity while achieving 1.5-2x speedup on repetitive/code tasks.
Journey Context:
Users implement speculative decoding by loading a separate small draft model \(e.g., 68M-7B\) alongside the target 70B model, doubling memory usage and complicating deployment. The n-gram look-ahead method \(in llama.cpp\) uses the last N tokens of the target model to predict continuations via n-gram matching, requiring no extra model. Tradeoff: Best for repetitive patterns \(code, JSON\), less effective for creative writing. Most tutorials focus on \`--draft\` and completely omit \`-np\` \(n-gram predict count\) and \`-ns\` \(n-gram size\) flags because the feature is newer and distinct from draft-model speculation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T16:50:21.028904+00:00— report_created — created