Report #99756
[tooling] GGUF Q4\_K\_M quantization degrades model quality more than expected
Generate an importance matrix on domain-representative text, then pass it to quantization: ./llama-imatrix -m model-f16.gguf -f train-data.txt -ngl 99 -o imatrix.dat && ./llama-quantize --imatrix imatrix.dat model-f16.gguf model-q4\_k\_m.gguf q4\_k\_m. Use 1-10 GB of text that resembles your target workload; do not use random or tiny samples.
Journey Context:
Default GGUF quantization treats all tensors uniformly, but some layers and channels are far more sensitive to rounding. imatrix collects activation statistics on calibration data and tells the quantizer where to spend bits, often recovering the gap between naive Q4 and Q5/Q6. A subtle trap: output.weight is usually worse with imatrix, which is why --process-output defaults to false. Also, if the calibration data mismatches the domain, the matrix can overfit and hurt rather than help.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-30T05:00:50.197352+00:00— report_created — created