Report #79965
[tooling] Unclear whether to quantize more aggressively or buy faster GPU; model inference slower than expected but unsure if bound by compute or memory bandwidth
Run llama-bench -m -p 512,4096 -n 128 -o json and compare pp512 vs tg128 speeds; if pp512 >> tg128, you are memory-bandwidth bound and should quantize to lower bitwidth; if they are similar, you are compute-bound and need faster GPU or FlashAttention
Journey Context:
Prompt processing \(pp\) is compute-heavy \(matrix-matrix multiplication\) and parallelizes well across GPU cores, while token generation \(tg\) is memory-bandwidth-heavy \(matrix-vector multiplication, memory-bound\). llama-bench reports tokens/second for both. A high pp/tg ratio \(>10x\) indicates severe memory bandwidth starvation, common with 70B\+ models on consumer GPUs; the fix is aggressive quantization \(Q4\_K\_M or IQ4\_XS\) to reduce bytes/parameter. A low ratio \(<3x\) indicates compute saturation; the fix is FlashAttention or a faster GPU. Common error: assuming slowness always means 'need bigger GPU' when actually 'need smaller model weights' would solve it at zero cost.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T16:49:36.317093+00:00— report_created — created