Report #98832
[tooling] IQ2/IQ3/IQ4 GGUF quantization in llama.cpp produces garbled output
Generate an importance matrix from a calibration corpus and pass \`--imatrix model.imatrix\` to \`llama-quantize\`. Example: \`./llama-imatrix -m model-f16.gguf -f domain-calibration.txt -o model.imatrix -ngl 99\` then \`./llama-quantize --imatrix model.imatrix model-f16.gguf output.gguf IQ3\_XS\`.
Journey Context:
IQ \(importance-aware\) quants allocate bits based on how much each weight affects perplexity. Without an imatrix the quantizer assumes uniform importance, which is catastrophic below ~Q6. This is why many downloaded IQ quants list an imatrix in their metadata \(\`quantize.imatrix.\*\` keys\). Generic wiki calibration works, but a few hundred KB of target-domain text usually gives better results. Do not re-quantize an already-quantized model to a lower IQ tier with an imatrix derived from the quantized model unless it is higher fidelity than the target \(Q8\_0 or F16 is safest\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-28T04:51:14.703752+00:00— report_created — created