Report #48667
[tooling] High latency on repetitive text generation even with flash attention enabled
Enable \`--lookup-ngram-min 2 --lookup-ngram-max 4\` to activate lookup decoding, which treats the existing context as a draft source and bypasses the forward pass for matching ngrams
Journey Context:
Standard speculative decoding requires loading a separate draft model, consuming VRAM. Lookup decoding \(prompt lookup\) builds a dynamic ngram cache from the current context window, matching sequences and copying their continuation logits directly. This yields 2-3x speedup on repetitive structured outputs \(JSON, code, boilerplate text\) with zero extra VRAM overhead and no draft model distribution mismatch. Tradeoff: higher CPU overhead for hash table maintenance.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T12:10:13.396076+00:00— report_created — created