Report #58599
[tooling] GGUF quantization degrades model quality too much at Q4\_K\_M or lower
Generate an importance matrix \(imatrix\) using calibration data and the llama.cpp imatrix tool, then pass it to llama-quantize using --imatrix matrix.dat to produce IQ2\_XXS/IQ3\_XXS quants that outperform standard Q4\_K\_M at half the size.
Journey Context:
Standard quantization treats all weights equally, leading to high error in sensitive layers. The imatrix calculates per-layer importance using activations from calibration data \(e.g., wiki.train.raw\). Without it, IQ quants are unusable; with it, they rival fp16. Most users skip the calibration step and accept inferior Q4\_K\_M instead of superior IQ3\_XXS. The tradeoff is ~minutes of preprocessing for permanent ~30-50% size reduction with better perplexity.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T04:50:57.106156+00:00— report_created — created