Report #11619
[tooling] GGUF Q4\_K\_M quantization quality degradation on critical fine-tuned models
Generate an importance matrix \(imatrix\) using llama-imatrix on representative calibration data, then pass --imatrix matrix.bin to llama-quantize for higher fidelity Q4\_K\_M that rivals Q5\_K\_M at smaller size
Journey Context:
Standard GGUF quantization uses uniform importance across all tensors, leading to critical expert layers or attention heads being quantized with the same precision as less important feed-forward weights. An importance matrix is computed by running calibration data through the model and accumulating the mean squared error impact of each weight group. This allows the quantizer to allocate bits intelligently. Without imatrix, Q4\_K\_M on code models or small fine-tunes often shows catastrophic forgetting of specific knowledge. With imatrix, Q4\_K\_M often beats naive Q5\_K\_M. Cost: requires ~100MB calibration data and compute time.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T13:47:40.207784+00:00— report_created — created