Report #11239
[tooling] Slow generation for structured/repetitive outputs \(JSON, code\) in llama.cpp without a draft model
Enable n-gram lookup decoding by adding \`--lookup 2\` \(or 3-4\) to the command. This caches n-grams from the prompt/context and matches them during generation, providing speculative speedups for repetitive text with zero extra VRAM or model loading overhead.
Journey Context:
Most users immediately reach for \`--draft\` with a separate draft model, which doubles VRAM usage and complicates deployment. Lookup decoding \(added Nov 2024\) instead treats the prompt/context itself as the speculation source. It shines when generating structured formats where phrases repeat \(e.g., '\},
"key":'\). Unlike draft models, it works on CPU-only machines and requires no second model file. Tradeoff: limited to matching exact n-grams from context, so less effective for creative prose where speculation fails often.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T12:50:16.549849+00:00— report_created — created