Report #90825
[tooling] Quantized GGUF model has high perplexity loss compared to original
Generate an importance matrix using \`llama-imatrix\` on 100-200MB of representative text, then pass to \`llama-quantize\` with \`--imatrix file.imatrix\`. Use IQ4\_XS or IQ3\_XXS for 30% smaller files with better quality than Q4\_K\_M.
Journey Context:
Standard quantization treats all weights equally, but transformer layers have varying sensitivity. IMatrix calibration identifies which tensors matter most, allowing aggressive quantization of robust weights while protecting critical ones. Most users skip this because it requires a calibration dataset and an extra step, resulting in significantly worse IQ quants. The tradeoff is ~10-20 minutes of preprocessing for 15-20% better perplexity at the same file size.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T11:02:45.976187+00:00— report_created — created