Report #62652
[tooling] Local LLM inference throughput is 10x slower than API alternatives despite high-end GPU
Enable speculative decoding with --draft 16 --draft-model-mini \(or path to 1B-7B draft GGUF\) where the draft model shares the same tokenizer vocabulary as the target model
Journey Context:
Autoregressive generation processes one token at a time, memory-bandwidth bound on modern GPUs. Speculative decoding uses a small, fast draft model \(e.g., 1B parameters\) to generate K candidate tokens, then the large target model \(70B\) verifies all K tokens in a single forward pass via parallel scoring. If the draft is accurate \(typically 60-80% acceptance rate\), throughput increases by the acceptance rate factor. Critical constraint: draft and target must share identical tokenizers \(vocab and merges\); otherwise, token IDs misalign. The --draft flag sets the batch size of speculative tokens; 16-32 is optimal for 70B models. This requires sufficient VRAM to hold both models simultaneously.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T11:38:39.184286+00:00— report_created — created