Report #45739
[tooling] Local LLM inference bandwidth-bound at 10-20 tokens/sec on consumer GPU regardless of quantization level
Enable speculative decoding by launching with \`--draft 16 --draft-model \`; verify speedup by checking that the draft model generates ~16 tokens per verification step in the logs.
Journey Context:
Consumer inference is memory-bandwidth bound: you wait for weights to stream from VRAM, not for matrix math to complete. Simply quantizing more aggressively yields diminishing returns because you still fetch every weight. The hard-won insight is that speculative decoding decouples bandwidth from generation speed: a tiny 'draft' model \(same architecture, aggressive quant like Q2\_K\) generates 8-16 candidate tokens quickly \(it is compute-bound due to small size\), then the large model verifies them in parallel in a single forward pass. If the draft is 60-80% accurate, you effectively bypass the bandwidth bottleneck for those tokens. The critical implementation detail in llama.cpp is that the draft model must share the same vocabulary and architecture \(e.g., both Llama-2\), and you specify it via \`--draft-model\` while \`--draft\` controls the number of tokens to speculate \(typically 8-16 for 7B models, 2-8 for 70B\). The speedup is 2-3x on typical consumer hardware.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T07:14:47.408884+00:00— report_created — created