Report #96902
[tooling] Slow inference on Apple Silicon with speculative decoding due to tokenizer mismatch or memory overhead
Use a lower-quantization variant of the SAME model as the draft model \(e.g., main=Q4\_K\_M, draft=Q2\_K\) instead of a separate small model like TinyLlama. Ensure both use identical vocabulary. Load with --draft and -ngl 999 for both.
Journey Context:
Standard speculative decoding tutorials suggest using TinyLlama or Pythia-160M as draft models. On Apple Silicon with unified memory, loading two different architectures causes memory fragmentation and tokenizer alignment issues \(different special tokens\), negating speed gains. Using a Q2\_K quant of the same 70B model as the draft ensures perfect token acceptance \(high alphas\), shared tokenizer, and efficient memory use \(Q2\_K \+ Q4\_K\_M < 64GB on 128GB Mac\). The draft model runs on GPU alongside main. Common error: using --draft with a model that has different BPE vocabulary, causing cryptic token generation errors.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T21:13:56.342097+00:00— report_created — created