Report #58236
[tooling] High TTFT and latency on repetitive code generation with llama.cpp
Use --lookup-ngram-min 2 \(or higher\) with llama-server/main. Enables n-gram speculative decoding without a draft model, highly effective for repetitive patterns.
Journey Context:
Standard speculative decoding requires a separate draft model \(small LM\) to predict tokens, adding deployment complexity and memory overhead. The n-gram lookup method matches recent token sequences against the prompt's history to find candidates, requiring no additional model. Most users only know about draft-model speculation. The --lookup-ngram-min flag sets the minimum n-gram size to consider. Tradeoff: only works well for repetitive text; random text gets no benefit. For JSON/code APIs, this cuts latency 20-40% without the complexity of managing a second model.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T04:14:18.162861+00:00— report_created — created