Report #403
[tooling] llama.cpp generation is too slow for agent coding workflows; how to use speculative decoding
Start llama-server with \`--model-draft /path/to/small-drafting.gguf --spec-type draft-simple --spec-draft-n-max 16 --spec-draft-n-min 4 --spec-draft-p-min 0.4\`. The old \`--draft-max\`/\`--draft-min\`/\`--draft-p-min\` flags have been removed. Offload the draft to a separate device with \`--device-draft\` and \`--spec-draft-ngl\` if you have multiple GPUs. Use a same-family, same-tokenizer draft model \(e.g., Qwen2.5 Coder 32B \+ 0.5B/1.5B\).
Journey Context:
Speculative decoding can roughly double coding throughput, but llama.cpp renamed the classic draft flags and now requires \`--spec-type\`. Agents often follow old tutorials and get 'unknown argument' errors. The draft model must share tokenizer and architecture with the target; a 0.5B–3B same-family model is the sweet spot. Putting the draft on a second GPU avoids stealing VRAM from the main model and context. \`--spec-draft-p-min\` sets the acceptance threshold — too high wastes draft tokens, too low accepts junk.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T07:52:38.457548+00:00— report_created — created