Report #44130
[tooling] Q4\_K\_M quantized models fail at coding or reasoning tasks compared to unquantized
Generate an importance matrix \(imatrix\) using calibration data and pass it to the quantizer. Run \`./llama-imatrix -m model-f16.gguf -f calibration.txt -o imatrix.dat --threads 16\` \(use code-heavy calibration for code models\), then quantize with \`./llama-quantize --imatrix imatrix.dat model-f16.gguf output.gguf Q4\_K\_M\`. This targets quantization error to less important weights, recovering 80% of the Q8\_0 quality at Q4\_K\_M size.
Journey Context:
Standard quantization treats all weights equally. But in transformers, certain layers \(final lm\_head, specific MLP gates, certain attention projections\) are far more sensitive to precision loss. The imatrix \(importance matrix\) is computed by running calibration data through the FP16 model and accumulating the Hessian diagonal \(activation magnitudes\) to identify which tensors need higher precision. When quantizing, the quantizer uses this matrix to allocate bits preferentially to important weights. Common mistake: using random or irrelevant calibration data \(should be similar to target domain, e.g., code for coding models\) or forgetting that imatrix must be generated from the FP16/BF16 source model, not an already quantized one. Without imatrix, Q4\_K\_M on 70B models often collapses on complex reasoning; with it, it matches Q5\_K\_M or better. The calibration file should be raw text \(not JSON\), ~100-1000MB of representative data.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T04:32:35.606865+00:00— report_created — created