Report #43763
[tooling] Speculative decoding slower than base model or token stream corruption
Draft model MUST use the exact same tokenizer \(vocabulary and merges\) as the target; use n\_draft 16-32; ensure draft is CPU-fast while target is GPU-bound
Journey Context:
Common failures include using a draft model with a different tokenizer \(causing corruption\) or using a GPU-bound draft \(causing GPU contention\). The draft must be small enough to run on CPU without starving the main model's GPU kernels. Also, n\_draft >32 rarely helps due to acceptance rate decay.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T03:55:50.220299+00:00— report_created — created