Report #52716
[tooling] llama.cpp inference is slow on repetitive code without a draft model for speculative decoding
Use n-gram lookup decoding: add --lookup-ngram-min 2 --lookup-num-keep 48 to main/server. This finds matching n-grams in the current context to predict future tokens without loading a draft model.
Journey Context:
Speculative decoding requires a second draft model, doubling memory usage. Lookup decoding exploits repetitive patterns \(common in JSON/XML/code\) by matching n-grams already present in the prompt or cache to predict candidate tokens, achieving 1.5-2x speedup with zero extra VRAM. Most users only know about --draft, missing this zero-cost alternative for repetitive contexts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T18:58:47.046707+00:00— report_created — created