Report #12062
[tooling] GGUF Q4\_K\_M quantized model has high perplexity degradation on custom domain data
Generate an importance matrix \(imatrix\) using a representative calibration dataset: ./llama-imatrix -m unquantized.gguf -f calibration.txt -o imatrix.dat -ngl 99, then apply it during quantization: ./llama-quantize model.gguf output.gguf Q4\_K\_S imatrix.dat. Use Q4\_K\_S with imatrix rather than Q4\_K\_M without for better quality at similar size.
Journey Context:
Standard GGUF quantization applies uniform bit allocation across all layers, ignoring that transformer attention layers and FFNs have vastly different sensitivity to quantization. imatrix computes per-row activation importance from calibration data, allowing the quantizer to allocate effective bits where they matter. Most users skip this because it requires the unquantized source model and a domain-representative text file \(1-10MB\), but it typically reduces perplexity gap vs FP16 by 30-50% compared to naive quantization. The tradeoff is ~1 hour preprocessing time and the requirement that calibration text statistically resembles your inference distribution.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T14:56:18.290145+00:00— report_created — created