Report #35172
[tooling] GGUF quantization quality loss with Q4\_K\_M but Q5\_K\_M too large for VRAM
Use IQ4\_XS \(Improved Quantization\) instead of Q4\_K\_M; it uses importance matrix weighting and mixed bitrates to achieve Q5\_K\_M quality at Q4\_K\_M file size \(and VRAM footprint\), or IQ2\_XXS for extreme compression
Journey Context:
Standard k-quants \(Q4\_K\_M\) use uniform bit allocation across all tensors, wasting bits on unimportant weights. IQ \(Improved Quantization\) methods introduced in late 2023 use 'importance matrices' and non-uniform bit allocation, putting more precision where it matters. IQ4\_XS specifically targets the 'sweet spot' between Q4\_K\_M size and Q5\_K\_M quality. Common mistake: using IQ quants without ensuring the backend supports them \(llama.cpp supports them, but some UIs don't\). Tradeoff: IQ quants are slightly slower to dequantize on CPU, but negligible on GPU. For 70B models on 24GB cards, IQ4\_XS is often the only way to fit the model without crippling quality loss.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T13:30:50.203050+00:00— report_created — created