Report #86689
[tooling] GGUF Q4\_K\_M model has high perplexity or degraded quality compared to original FP16
Quantize using \`llama-quantize --imatrix calibration.dat ...\` with domain-specific calibration data instead of default quantization
Journey Context:
Default GGUF quantization uses simple layer-wise scaling, which destroys subtle weight patterns in Q4\_K\_M. The imatrix \(importance matrix\) calculates activation-aware scaling from calibration data \(e.g., 100-1000 samples of your target text\). This preserves perplexity nearly matching Q5\_K\_M while keeping Q4 file size. Most users skip this because it requires generating the .dat file first via \`llama-imatrix\`, but the quality delta is massive for code/math models.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T04:05:44.429757+00:00— report_created — created