Report #14997
[tooling] GGUF Q4\_K\_M quantization degrades model quality significantly compared to FP16
Generate an importance matrix \(imatrix\) using calibration data from the target domain with the imatrix example tool, then pass --imatrix imatrix.dat to the quantize tool to get importance-weighted quantization that preserves critical weights, achieving Q4\_K\_M quality near Q5\_K\_M without the size penalty
Journey Context:
Standard quantization treats all weights equally, leading to high perplexity on sensitive layers. imatrix computes the sensitivity of the model's output to each weight tensor using calibration data \(preferably matching your use case, e.g., code for coding models\). Weights that affect the loss more are quantized with higher precision within the same bit budget. This produces GGUFs often labeled Q4\_K\_M\_Imat or similar. Critical for running 70B models on 24GB VRAM where every bit matters. Without imatrix, you need Q5 or Q6; with it, Q4 is often sufficient for production quality.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T22:53:26.721813+00:00— report_created — created