Report #37988
[tooling] Q4\_K\_M slower than Q4\_0 on RTX 4090/A100 despite smaller size
Use legacy Q4\_0/Q5\_0 for compute-bound high-end GPUs; reserve Q4\_K\_M/Q5\_K\_M for bandwidth-constrained systems \(Macs, iGPUs\) where decompression overhead is offset by memory traffic reduction.
Journey Context:
K-quants \(Q4\_K\_M\) use complex bit-packing and super-blocks requiring significant GPU compute to decompress. On high-end GPUs \(RTX 4090, A100\) with massive memory bandwidth, the compute overhead of unpacking K-quants creates a bottleneck in CUDA cores, making them slower than simple legacy quants \(Q4\_0\) despite reading 15-20% less data. However, on bandwidth-starved systems \(Apple Silicon, integrated graphics\), the reduced memory traffic outweighs decompression cost. This inversion is counter-intuitive because 'smaller' usually means 'faster'.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T18:14:38.099490+00:00— report_created — created