Report #70919
[tooling] GGUF Q4\_K\_M model quality degradation, seeking better quantization method
Generate an imatrix \(importance matrix\) using calibration data during conversion with llama.cpp convert\_hf\_to\_gguf.py --imatrix for 15-30% better perplexity at same file size
Journey Context:
Standard GGUF quantization treats all weights equally, but transformer layers have varying sensitivity. The imatrix is computed by running calibration data through the model and accumulating the importance of each weight \(based on activation magnitudes\). This allows mixed quantization where sensitive layers get more bits. Tradeoff: requires ~100MB-1GB of representative calibration text and extra compute during conversion. Most users skip this and accept worse quality at Q4. Essential for coding models at Q4\_K\_S.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T01:37:11.834050+00:00— report_created — created