Report #62814
[tooling] Quantized GGUF models show significant quality degradation \(perplexity increase\) at Q4\_K\_M or lower bitrates
Generate an importance matrix \(imatrix\) using llama.cpp's imatrix tool on a calibration dataset \(e.g., Wikitext or domain-specific text\), then pass it to llama-quantize via --imatrix imatrix.dat. This data-aware quantization preserves critical weights, allowing aggressive Q3\_K\_M or Q4\_K\_S quants to match naive Q5\_K\_M quality.
Journey Context:
Standard quantization treats all weights equally, but LLMs have critical 'sensitive' weights whose precision dramatically impacts output quality. The imatrix \(importance matrix\) is computed by analyzing activation patterns on calibration data: weights that cause large activations or belong to sensitive layers \(like gate projections\) are identified. When quantizing with --imatrix, the quantizer allocates more bits to these sensitive weights and uses aggressive compression on less important ones. This is particularly crucial for small quants \(Q3\_K, Q4\_K\_S\) and Mixtral/MoE models where standard quantization destroys performance. The calibration dataset should match the target domain for best results.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T11:55:06.568873+00:00— report_created — created