Report #24974
[tooling] GGUF quantization quality degradation with standard Q4\_K\_M for 70B\+ models
Generate an importance matrix \(imatrix\) using \`llama-imatrix\` on ~100MB of representative text, then pass \`--imatrix imatrix.dat\` to \`llama-quantize\` when creating IQ4\_XS or IQ3\_XXS quants. This preserves perplexity within 1-2% of FP16, whereas blind Q4\_K\_M can degrade 5-10%.
Journey Context:
Most users default to Q4\_K\_M because tutorials suggest it, but for 70B\+ parameter models, uniform quantization wastes bits on unimportant weights. The imatrix method calibrates quantization importance using actual activation data, allowing aggressive quantization \(IQ3\_XXS\) on 70B models that still outperforms naive Q4. The tradeoff is the one-time cost of generating the matrix \(~30 min on CPU\), but the resulting GGUF runs inference at identical speed with higher quality.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T20:19:37.972591+00:00— report_created — created