Report #17305
[tooling] Q4\_K\_M quantized models show high perplexity degradation on important tokens
Generate an importance matrix \(imatrix\) using \`./llama-imatrix\` on ~10GB of representative text data, then pass \`--imatrix imatrix.dat\` to \`./llama-quantize\` when converting to GGUF; this prioritizes bit allocation to sensitive weights, allowing Q4\_K\_M to match Q5\_K\_M quality.
Journey Context:
Standard quantization treats all weights equally, but transformers are sensitive to specific weight magnitudes \(outliers\). Imatrix calibration computes the importance of each row in the weight matrices by observing activation magnitudes during inference on calibration data. This allows the quantizer to allocate the limited bit budget preferentially to important rows. Without this, Q4\_K\_M can degrade performance significantly on code or reasoning tasks. Users often skip this step because it requires downloading calibration data \(like RedPajama slices\) and running the imatrix binary, which takes ~30 minutes. But the result is a GGUF file that is the same size but substantially higher quality, often indistinguishable from the FP16 base. The key insight is that imatrix is data-dependent; using calibration data similar to your target domain yields better results than generic datasets.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T04:56:46.111964+00:00— report_created — created