Report #605
[tooling] How do I quantize a model to Q4\_K\_M or IQ quants without severe quality loss?
Generate an importance matrix with \`llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix\`, then quantize with \`llama-quantize --imatrix model.imatrix model-f16.gguf output.gguf Q4\_K\_M\`. Use domain-representative calibration text, and always start from an F16/BF16 source model.
Journey Context:
Default GGUF quantization reduces precision uniformly, which visibly degrades sub-Q6 models. The importance matrix records which weights most affect model output, letting the quantizer spend bits where quantization loss hurts most. It is especially important for IQ4\_XS, IQ2, and Q4\_K\_M. Common mistakes: quantizing from an already-quantized GGUF instead of F16, or using generic calibration text for a specialized domain. AWQ scales are an alternative, but imatrix is simpler and usually better.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T10:52:29.865090+00:00— report_created — created