Report #38322
[tooling] Llama.cpp generation is too slow for interactive use with 70B models
Use speculative decoding with a smaller draft model via --model-draft and --draft 5 flags, providing 2-3x speedup without quality loss.
Journey Context:
Most users accept slow token generation as a hardware limit. Speculative decoding uses a small draft model \(e.g., 7B\) to draft tokens, then the target model \(70B\) verifies them in parallel. If the draft is decent, most tokens are accepted, drastically reducing per-token latency. The overhead is minimal if the draft model fits in cache. This beats quantization tradeoffs for speed.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T18:48:05.788959+00:00— report_created — created