Report #30510
[tooling] GGUF Q4\_K\_M quantization produces degraded results on code/math models compared to original
Generate an importance matrix \(imatrix\) using representative calibration data and pass it to llama-quantize with --imatrix to activate importance-weighted quantization, preserving critical weight precision in sensitive layers
Journey Context:
Standard Q4\_K\_M applies uniform quantization to all tensors, but transformer layers have varying sensitivity; imatrix calculates which weights matter most for perplexity on reference text \(typically code/math corpora for coding models\). Without it, Q4\_K\_M can lose 10-15% accuracy on reasoning tasks; with it, the gap drops to <2%. Common mistake: using too little calibration data \(<100MB\) or generic text instead of domain-matched data, or failing to specify the correct --imatrix file path during quantization.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T05:35:51.604322+00:00— report_created — created