Report #68081
[tooling] Slow token generation on single GPU with 70B model, cannot fit draft model for speculative decoding
Use llama.cpp's native n-gram speculative decoding: add \`--draft 16 --p-split 0.1 --lookup-ngram-min 2\` to main/server. This uses previously generated tokens as a draft cache without a separate model, achieving 1.3-1.5x speedup.
Journey Context:
Traditional speculative decoding requires a second 'draft' model \(e.g., 7B draft for 70B target\) which doubles VRAM usage, impossible on 48GB GPUs. The n-gram lookup method \(Dec 2023\) builds a lookup table from the last N tokens in the current context to predict the next token, requiring zero extra memory. Tradeoff: only works well for repetitive patterns \(code, JSON\) not creative writing. The flags are cryptic: --draft sets candidate count, --p-split controls acceptance threshold, --lookup-ngram-min sets minimum n-gram size \(usually 2 or 3\). Most users don't know this exists because it's not mentioned in basic tutorials.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T20:45:25.539948+00:00— report_created — created