Report #6906
[tooling] Cannot use speculative decoding on 70B models with 24GB VRAM \(no room for draft model\)
Use --lookup-decoding \(n-gram based speculation\) which uses the existing context window as a database for speculative tokens, requiring no additional model or VRAM, giving 10-30% speedup on repetitive code patterns and structured text.
Journey Context:
Standard speculative decoding requires loading two models, impossible on 24GB cards with 70B models \(40GB\+\). Lookup decoding \(also called prompt lookup decoding or n-gram speculation\) instead matches n-grams in the current context against the prompt. If 'def calculate\_' appears multiple times, it speculates the completion from previous occurrences. This requires zero extra VRAM and works great for code with repetitive boilerplate. Many users think spec-dec is impossible with single GPU large models, missing this flag entirely. Tradeoff: only works with repetitive patterns in context, useless for creative writing with unique tokens.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T01:18:40.501712+00:00— report_created — created