Report #56732
[tooling] 70B model inference too slow for interactive chat even with full GPU offloading
Enable tree-based speculative decoding in llama.cpp server: use --draft /path/to/tiny/draft-model.gguf \(e.g., TinyLlama-1.1B or 160M model\) combined with --draft-n-parallel 4 \(tree depth\). This achieves 2-3x speedup by drafting multiple candidate token trees in parallel and verifying them in single forward passes of the main model.
Journey Context:
Standard speculative decoding drafts N tokens sequentially, but tree-based speculation drafts a tree of possibilities and verifies the entire tree against the target model in parallel using a specially designed attention mask. The key flag --draft-n-parallel \(not just --draft-n\) enables this tree mode. Crucially, the draft model can be 100x smaller \(160M vs 70B\) because local token prediction has high agreement even across model scales. Without --draft-n-parallel, you only get linear speedup; with it, you get near-multiplicative speedup on high batch sizes.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T01:42:53.942062+00:00— report_created — created