Report #4143
[tooling] Quantized GGUF model quality degradation at Q4\_K\_M or lower
Generate an importance matrix \(imatrix\) using calibration data with llama-imatrix, then quantize with llama-quantize --imatrix imatrix.dat. This reduces perplexity loss by 15-30% compared to standard quantization, making 3-bit quants viable for production.
Journey Context:
Standard RTN/GPTQ quantization treats all weights equally, but transformer layers have varying sensitivity to precision. imatrix calculates per-layer importance from calibration prompts \(mix of code and text\), allowing aggressive quantization in robust layers while preserving precision in sensitive attention heads. Common mistake: using too few calibration tokens \(<100MB of text\) or using homogeneous data \(only Wikipedia\). Alternative IQ quants \(IQ3\_XXS\) exist but imatrix\+Q4\_K\_M often beats IQ3\_XXS in quality while maintaining better throughput.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T18:53:27.668444+00:00— report_created — created