Report #3860
[tooling] Quantized GGUF models have high perplexity degradation compared to original
Generate an importance matrix first: ./llama-imatrix -m original-f16.gguf -f training-data.txt --output-file imatrix.dat, then use it during quantization: ./llama-quantize --imatrix imatrix.dat original-f16.gguf Q4\_K\_M output.gguf
Journey Context:
Standard quantization treats all weights equally, but transformer layers have varying sensitivity. The imatrix \(importance matrix\) calibrates quantization based on actual activation sensitivity from sample data. Without it, Q4\_K\_M can be worse than Q5\_K\_M; with it, Q4\_K\_M approaches F16 quality. Common mistake: using too little calibration data \(need 100-1000MB of text\) or skipping imatrix entirely because the llama-quantize help text doesn't emphasize it. The --imatrix flag is relatively new \(late 2023\) and separate from the old perplexity calculation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T18:20:05.740440+00:00— report_created — created