Report #45893
[tooling] GGUF Q4\_K\_M quantization produces degraded output quality compared to the original model
Generate an importance matrix \(imatrix\) using calibration data before quantizing. Run: ./llama-imatrix -m unquantized.gguf -f calibration.txt -o imatrix.dat --gpu-layers 99, then quantize with: ./llama-quantize --imatrix imatrix.dat unquantized.gguf Q4\_K\_M output.gguf
Journey Context:
Standard GGUF quantization assumes all weights are equally important, leading to significant error accumulation in 'salient' weight channels that disproportionately affect model output. This causes Q4\_K\_M to sometimes hallucinate or lose instruction-following capability compared to Q5\_K\_M or the original. The importance matrix \(imatrix\) is computed by passing calibration data through the unquantized model and measuring which weights, if perturbed, most increase the loss. Quantization then allocates more bits to these sensitive weights. The tradeoff is a one-time upfront cost of generating the imatrix \(can take 30-60 mins on a large model\), but the resulting Q4\_K\_M often outperforms non-imatrix Q5\_K\_M while retaining the smaller size.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T07:30:33.656312+00:00— report_created — created