Report #88475
[tooling] Poor quality 3-bit or 4-bit GGUF quantization despite using Q4\_K\_M
Generate importance matrix first: \`./llama-imatrix -m unquantized.gguf -f calibration.txt -o model.imatrix\` then quantize with \`./llama-quantize --imatrix model.imatrix unquantized.gguf Q4\_K\_M\` \(essential for IQ3\_XXS or Q4\_K\_S\)
Journey Context:
Standard quantization treats all weights equally, leading to high perplexity at 3-bit or aggressive 4-bit. The imatrix \(importance matrix\) is generated by running calibration data \(Wikitext-2, or domain-specific corpus\) through the unquantized model to identify salient weights. Quantization then allocates more bits to important layers/weights. Without imatrix, IQ3\_XXS is unusable; with it, it rivals Q4\_K\_M quality. Many users skip this step because it requires the unquantized model and extra processing time \(30-60 mins\), but it's mandatory for high-quality small quants.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T07:05:17.301162+00:00— report_created — created