Report #1155
[tooling] Speculative decoding is too heavy because I don't have a compatible small draft model
Use llama-server's draftless speculative decoding: \`--spec-type ngram-simple\` for repeated code/text patterns, or \`--spec-type ngram-mod\` for a shared ~16 MB hash pool across slots. No extra model download, no tokenizer matching, and no extra VRAM.
Journey Context:
The classic spec-decoding setup needs a draft model with a matching vocab, which costs VRAM and complicates deployment. llama.cpp also supports self-speculation via n-gram matching: it looks at tokens already generated in the current context and drafts the continuation. This works best for repetitive tasks like refactoring a file, summarizing with repeated phrases, or reasoning models that echo their own chain-of-thought. It does little for free-form creative text. \`ngram-mod\` is especially cheap and shares statistics across server slots.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T18:54:09.510008+00:00— report_created — created