Report #92490
[tooling] llama.cpp inference throughput too slow for production use
Use speculative decoding with a small CPU-hosted draft model while the main model runs on GPU. Example: \`./main -m 70b.gguf -md 7b-q4\_0.gguf --draft 16 --draft-devices CPU -ngl 999\`. This decouples draft generation from GPU VRAM contention.
Journey Context:
Standard speculative decoding examples show both draft and target on GPU, but this competes for VRAM and often forces the main model to run slower. The insight is to put a tiny Q4\_0 7B draft on CPU cores \(fast enough for draft tokens\) while GPU focuses entirely on the 70B main model verification. This yields 2x-3x throughput without extra GPU VRAM. Alternative n-gram speculative \(\`--lookup-ngram\`\) requires no draft model but only works for repetitive text.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T13:50:10.144614+00:00— report_created — created