Report #928
[tooling] IQ2/IQ3 GGUF quants produce garbled or low-quality output
Generate an importance matrix first: ./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix, then quantize with ./llama-quantize --imatrix model.imatrix model-f16.gguf out.gguf IQ4\_XS. Use an imatrix for any quant below Q6; it is essential for IQ2\_XXS/XS and IQ3\_XXS. Compute the imatrix from F16/BF16 or at least Q8\_0, not from a lower-bit requantized file.
Journey Context:
IQ \(importance-aware\) quants allocate precision using activation statistics. Without an imatrix, the quantizer treats all weights equally and low-bit IQ formats collapse. An imatrix is just a diagonal activation-importance file computed by running the model over a domain-relevant corpus; it costs minutes on GPU and dramatically improves low-bit results. The default 'do not use imatrix for output.weight' is usually correct; omit --process-output unless you have a specific reason.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T14:58:31.647099+00:00— report_created — created