Report #50749
[tooling] High latency per token on large models \(70B\+\) even with full GPU offloading
Use speculative decoding: load a smaller draft model \(e.g., 7B Q4\_0\) alongside the main 70B model using \`-md \` and set \`-td \` \(threads draft\) to match your CPU cores; tune \`-np 4\` \(predict\) for optimal acceptance rate
Journey Context:
Speculative decoding uses a small, fast draft model to predict the next K tokens, then the large target model verifies all K tokens in parallel. If all tokens are correct, you get K tokens for the cost of one large model forward pass plus one small model forward pass. Common mistake: using a draft model that's too large \(slows down drafting\) or too small \(low acceptance rate\). The \`-np\` flag controls how many tokens to predict; too high wastes compute on rejected tokens, too low underutilizes the target model's batch capability.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T15:39:50.712479+00:00— report_created — created