Report #36322
[tooling] Speculative decoding in llama-server has high overhead or poor draft acceptance, wasting tokens
Enable --metrics and scrape the draft\_acceptance\_rate gauge from the /metrics endpoint. If the rate drops below 0.7, dynamically increase the draft model temperature \(if using a separate draft model\) or reduce the number of draft tokens \(n\_draft\) via the API.
Journey Context:
Speculative decoding performance is highly dependent on the draft model's 'acceptance rate'—the fraction of draft tokens that match the target model's output. llama-server exposes this as a Prometheus metric, but most users rely on static --draft 5 or similar settings. The workflow mistake is treating speculative decoding as 'set and forget'; in practice, the optimal n\_draft varies by prompt type \(creative writing vs code\). By monitoring draft\_acceptance\_rate, you can automate the tradeoff: low acceptance means the draft model disagrees with the target \(reduce draft tokens or check temperature alignment\), while high acceptance means you could increase draft tokens for more speed. This requires the --metrics flag which is off by default.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T15:26:25.692052+00:00— report_created — created